Optimizing event-related potential based brain–computer interfaces: a systematic evaluation of dynamic stopping methods

Objective. In brain–computer interface (BCI) research, systems based on event-related potentials (ERP) are considered particularly successful and robust. This stems in part from the repeated stimulation which counteracts the low signal-to-noise ratio in electroencephalograms. Repeated stimulation leads to an optimization problem, as more repetitions also cost more time. The optimal number of repetitions thus represents a data-dependent trade-off between the stimulation time and the obtained accuracy. Several methods for dealing with this have been proposed as ‘early stopping’, ‘dynamic stopping’ or ‘adaptive stimulation’. Despite their high potential for BCI systems at the patient's bedside, those methods are typically ignored in current BCI literature. The goal of the current study is to assess the benefit of these methods. Approach. This study assesses for the first time the existing methods on a common benchmark of both artificially generated data and real BCI data of 83 BCI sessions, allowing for a direct comparison between these methods in the context of text entry. Main results. The results clearly show the beneficial effect on the online performance of a BCI system, if the trade-off between the number of stimulus repetitions and accuracy is optimized. All assessed methods work very well for data of good subjects, and worse for data of low-performing subjects. Most methods, however, are robust in the sense that they do not reduce the performance below the baseline of a simple no stopping strategy. Significance. Since all methods can be realized as a module between the BCI and an application, minimal changes are needed to include these methods into existing BCI software architectures. Furthermore, the hyperparameters of most methods depend to a large extend on only a single variable—the discriminability of the training data. For the convenience of BCI practitioners, the present study proposes linear regression coefficients for directly estimating the hyperparameters from the data based on this discriminability. The data that were used in this publication are made publicly available to benchmark future methods.


Introduction
Recent advances in brain-computer interfaces (BCIs) have resulted in both applications and methods that come closer to practical use [1][2][3][4]. BCI systems allow a person to control a device without the use of the brain's normal efferent pathways. In other words, by inducing recognizable brain states, a person can convey her/his intention directly to a device. This is of particular interest for people with severe motor disabilities, but can also prove useful in other settings [5,6].
BCI systems based on event-related potentials (ERPs) have proven to be particularly robust [7]. The principle idea of any ERP-BCI is to stimulate the user with two or more different stimuli. The user focuses on one of these stimuli, called the target. This target is generally less frequent than the non-targets, either by design or by the larger number of different non-focused stimuli. The target and non-target stimuli elicit different brain patterns, which can be detected and exploited by the BCI in real-time. Currently, the most commonly employed method for detecting this signal is noninvasive electroencephalography (EEG).
ERP-based BCIs are most widely used for communication applications and were originally introduced as such by [8]. Since then, several variations have been proposed, changing the interface [9][10][11] or the modality [12][13][14]. The original idea, however, remains the same; the user communicates by focusing attention on one of several offered options and thereby conveys her/his intention to the BCI.
Typically, a single selection step is referred to as a trial. It contains two main time intervals which are important for the estimation of the efficiency of the system: the period which contains the actual stimulation activity, and the preand poststimulus pauses. The latter two are collectively called the overhead and are required to inform the user of a selection and to prepare a new trial. Though the overhead is typically fixed, the length of the stimulation period can be adjusted.
Generally, a round (or iteration) of one stimulus for each option is performed before proceeding to the next iteration. Several such iterations make up the stimulation part of a trial. This repetition is done in order to improve the relatively low signal-to-noise ratio (SNR) of the resulting ERPs, and thereby increase the accuracy of the decision (see figure 1(A)). However, as each additional iteration prolongs the length of a trial, there exists a trade-off between accuracy and speed, which calls for data-driven optimization. A longer stimulation typically means higher accuracy but lower speed, whereas a shorter stimulation may result in lower accuracy but higher speed. The optimum in this trade-off of course depends on the SNR of the ERPs, but is to a large extent also determined by application factors, such as the number of trials needed for correcting a wrong decision, the number of trials needed to write a letter and automatic error correction features of the application. All these should be taken into account for finetuning this trade-off.
Typically, the above trade-off is not dealt with for the online use of a BCI, and an arbitrary number of iterations is fixed beforehand. Subsequently, offline post-hoc analyses of the online data are used to generate performance curves over the number of iterations, given some metric (for an example, see figure 1(B)). The peak performance is then taken as a measure of the possible performance of the system. This simple strategy of a fixed number of iterations is widely used in BCI, but generalization might be troublesome due to their post-hoc The bitrate as a metric does not include any error correction strategy of the application. Peak performances are at very low numbers of iterations. (C) Using the SPM metric, peak performances are reached for larger iteration numbers. In fact, for three out of four profiles, the lower number of iterations as suggested by the bitrate peak performances would not have allowed the user to write. nature; the determined number of iterations would have been optimal in retrospect, but this does not necessarily hold for future data, as the SNR is prone to changes.
This leads to the challenge of optimizing the number of iterations of ERP-based BCIs. The simplest method, and an extension of the above, is to learn the optimal number of iterations from the calibration data and fix this for online use. This will be referred to as fixed stopping, as it does not adapt to changes in the online data. More dynamic methods have been proposed in the literature to account for the varying SNR or user performance over time [15][16][17][18][19][20]. They stop the stimulation at any point in a trial when enough evidence (in the form of classifier scores) has been accumulated for a correct decision. The hyperparameter of the stopping method should be learned based on the calibration data. In order to outperform fixed stopping, such dynamic methods should ideally be robust against the distortions which are commonly observed in EEG data, and they should preferable be trained on a small set of training data.
To test and compare these proposed methods, they were implemented and validated, both on artificial data (checking for common data distortions) and on real EEG data of several paradigms. All stopping methods were provided with the output of a linear classifier for training and testing purposes. Though more sophisticated classifiers are sometimes used (outputting for instance probabilities), the most widely used classifier in ERP based BCI remains linear discriminant analysis (LDA). For this reason, all methods were restricted to work with data obtained from a regularized LDA. Each stimulus was classified independently, and the accumulation of the evidence was left to the stopping methods. This ensures the applicability of the results, as the methods could be realized as a simple module between the existing BCI and the application.

Material and methods
The dynamic stopping methods which were proposed in the literature [15][16][17][18][19][20] were tested on a variety of data. Unless stated otherwise, the implementations used in the following comparisons are equal to those reported in the original work. All of the methods use a different terminology in their original manuscripts, yet their fundamental principles are very similar. For better readability, the terminology for all methods has been unified.
Classifiers were trained to assign (more) negative scores to target epochs. All analyses were done using Matlab (Mathworks, Version 7.13 R2011b). Some basic notations that will be used throughout are: J: number of iterations; j indexes iterations. C: number of classes; c indexes classes. T : number of trials; t indexes trials. D ∈ R C×J×T : tensor containing classifier outputs. D train : tensor containing the training classifier outputs. D test : tensor containing the test classifier outputs. l ∈ Z T : the vector with the true binary class labels.
: set of trained parameters of a stopping method. L: training function which returns parameters . F: online decision function. M: sub-function, describing the data.
All of the methods have access to D train , l and for training purposes. For testing purposes only D test , and are available. As there is no access to raw EEG data, the methods are implemented as a module in between the BCI and the application. This ensures maximum compatibility with existing processing pipelines.
Given the training data D train , all dynamic stopping methods are trained with a function L, such that The type of parameters contained in differs for each method. Given , the function F is called after each iteration of an online trial. It is defined for the current trial t at iteration j by where c w j,t indicates the winning class, and the empty set ∅ indicates that no early stop should be taken. In case a trial reaches the maximum number of J iterations, the winning class is returned (see equation (3)). In most cases, function F relies on a subfunction M, which summarizes the data. These functions are core to each dynamic stopping method.
Unless defined directly by the method, the winning class c w j,t is defined for trial t and iteration j as where the median is taken over all iterations up to iteration j, for each class. Throughout this manuscript, the classification data are often summarized in terms of its discriminability. The receiver-operator characteristic (ROC) is a measure to estimate the performance of a two-class classifier, expressed in the ratio between true-and false positive decisions for various values of the bias of the LDA decision function. It can be summarized in a single value as the area under the ROC curve (AUC). This measure is used here to describe the quality of the data. The AUC is bound between 1 (perfect separation, always estimates the correct class label) and 0 (perfect separation, always estimates the wrong class label). An AUC value of 0.5 means that the classifier is random and does not realize a class separation.

Tested methods
All methods are conceptually described here, within the above framework. A more detailed description is given if the implementation deviates from the original implementation. For a full account, the reader is referred to the original publications. All evaluations are done in the context of a text spelling application and thus by optimizing the hyperparameter S of a stopping method with respect to the symbols per minute (SPM) metric (see section 2.4 and 2.5). The following methods are investigated: No stopping. In this condition, no early stopping is performed. It uses all available iterations and is therefore the performance baseline. No parameters need to be estimated. Formally, this means that = {J}, the maximum number of iterations possible. F (D test , , t, j) returns c w j,t (according to equation (3)) when j = J.
Fixed (optimal). The simplest way of optimizing the BCI is to estimate an optimal fixed number of iterations on the calibration data and to apply this for the online data. Because the number of iterations can be set in virtually any BCI system, this method requires no implementation effort. The number of iterations j † which results in the highest SPM on the offline/training data is selected for online use. is thus defined as = { j † }. As j † is directly optimized it equals S. During online use, F (D test , , t, j) returns c w j,t (equation (3)) when j = j † . For the fixed condition, the optimal number of iterations is estimated once from all of the training data; it serves as a evaluation of the post-hoc estimation of BCI performance which is generally given in the BCI literature. For the fixed optimal condition, a re-sampling scheme (see section 2.5) is used to increase robustness.
Liu [17]. This approach simply sums the outputs of a linear classifier per class and over iterations. M is defined by for trial t and iteration j, thus taking the most negative class. The average of all target classification scores in D train , calledd, is taken as the initialization for optimizing = Sd (section 2.5). then contains this threshold for online use. During online operation, the decision for an early stop is taken as follows: where c w j,t is the class with the most negative summed score. In the original work, EEG features from iteration i are smoothed together by averaging them with features from iterations i − 1 and i − 2 before classification. Here, access is only given to the instantaneous classifier output by design, so this pre-processing step is left out.
Jin [20]. This method sets a minimum number of consecutive iterations ( j min ) which must result in the same class prediction. To measure this, M(D, t, j) is defined as 1 + n, where n denotes the number of iterations directly prior to iteration j which all lead to the same decision as j does (according to equation (3)). The original paper performs an additional pre-processing step on D to give D * where D * c, j,t = 1 j j j =1 D c, j ,t for each j j mean . Since it only includes processing the classifier outputs, the same is done here. Both values of = {j min , j mean } are themselves the hyperparameters (S 1 = j min and S 2 = j mean ). They are optimized directly according to the procedure in section 2.5.
In the online case, the decision function F is as follows: if M(D * ,test , t, j) j min and j j mean ∅, otherwise.
Here, both j min and j mean were bound between 2 and 5. As there are no decisions taken before j < j mean , the earliest possible stop is at iteration j mean + j min .
Lenhardt [16]. This method was originally proposed for a visual speller, and defines two thresholds. For both max(X−min(X )) , a vector normalization over the class dimension. Lenhardt requires the target class to have positive labels. Since the target class was defined to be negative in this framework, a sign reversal is done within this transformation.
The first threshold is based on a measure called the 'intensity', or M 1 , defined for iteration j and trial t as M 1 is thus simply the sum of all values in D * 1...C,t, j . The second threshold is based on a slightly different measure M 2 , defined as where . M 2 is thus the ratio of the highest to the second highest value in D * 1...C,t, j . To find both thresholds, the training function L(D * ,train , l, ) first calculates the intensity M 1 (D * , t, j) for the earliest possible stop which leads to a correct decision (according to equation (3)). This is done for each training trial, and the average of these intensities,ĩ, serves as an initialization for optimizing 1 = S 1ĩ (see section 2.5). The second threshold, 2 , is optimized on the training data directly The online decisions for an early stop are taken by F as follows: The original publication calculates the thresholds using data of several subjects. Here, it is restricted to the more common case of only having the calibration data of a single subject. Rank diff [19]. Rank diff was designed for the AMUSE paradigm. However, the AMUSE dataset used in the current performance comparisons was not available during the design of Rank diff. The method is based on M(D, t, j), defined for iteration j and trial t as j,t is found using equation (3), c * runs over all other classes, and med is the median operator.
= { 1 , . . . , J } contains J iteration dependent thresholds , allowing it to model the within-trial variation of ERP responses. To determine j (the threshold for iteration j), L(D train , l, , j) calculates M(D train , t, j) for each training trial. These are assigned to one of two vectors, depending on whether the estimated class label was correct (+ j ) or incorrect (− j ), using equation (3). To implement a security margin that prevents overly optimistic early stops, j is then defined as The hyperparameter S is set globally over all iterations, and is optimized as described in section 2.5.
A larger value for M means a better separation of the class medians, and thus more confidence in a decision. For online use, F is defined as Höhne [18]. This method was again proposed for an auditory speller paradigm. A t-test with unequal variances (also referred to as Welch's t-test) is calculated on D 1... j for each class against all other classes. Equation (12) shows the test statistic, where X i , s 2 i , N i are the sample mean, sample variance and sample size, respectively. In order to account for a varying number of data points, the t-test statistic t(D c,1... j,t , Dc ,1... j,t ) is transformed into a p-value (computed from a onesided test, see supplementary appendix A (available from stacks.iop.org/JNE/10/036025/mmedia)). This transformation is part of any statistics toolbox. M(D, t, j) is then defined as this p-value.
The natural threshold on such a p-value is the significance level, or = {α}. It is optimized for SPM on the training data directly, with α = S (section 2.5). During online operation, the decision for an early stop is taken according to the following where c w j,t is the class with the smallest M value. Zhang [15]. This method proposes a probabilistic approach to continuously tracking whether a BCI user is in control or non-control state. The core of the algorithm is the formulation of a joint likelihood function, which accumulates the evidence for being attended for each stimulus over iterations. Here, this evidence is used as a criterion for early stopping.
The method requires that the distributions of the classifier outputs for target and non-target stimuli, f a and f u , are known. In practice, they are estimated from the training data. Here, univariate normal distributions for the classifier outputs are assumed, which are fully characterized by their means and variances μ a/u and σ 2 a/u . Note that this assumption is compatible with the assumption of multivariate normal distributed data made by the LDA classifier (see supplementary appendix B (available from stacks.iop.org/JNE/10/036025/mmedia)).
Making the further simplifying assumption that the EEG data are independent across iterations, as well as across stimulations within one iteration, the joint likelihood of observing the classifier outputs of the first j iterations given that stimulus c is attended is According to Bayes' theorem, the posterior probability of stimulus c being attended after the first j iterations is given by where P(c), c = 1, . . . ,C denote the prior probabilities with which the C classes are selected by the user. These probabilities may be estimated in the training phase or, e.g., derived from a language model in spelling applications. If all classes are assumed to be equally likely, uniform priors P(c) = 1/C, c = 1, . . . ,C are used. Note that in practice, numerical inaccuracies may occur when evaluating equations (14) and (15) due to excessively small values of the joint likelihood function. Therefore, all computations are carried out on log probabilities, as in the original publication [15]. The joint likelihoods and posterior probabilities of each class are updated after each iteration. The function M(D test , t, j) is defined as the maximum of the posterior probabilities of all classes: A natural stopping condition is attained, if equation (16) exceeds 1 − α, where α = S is a hyperparameter to be optimized based on the training data (section 2.5). Thus, The decision at iteration j then reads: where c w The upper bound method uses knowledge about the class labels and is not applicable for normal online use. It is applied for reasons of comparison and to deliver an upper bound on the performance of any of the other stopping methods. It simply stops a trial at the first occasion which leads to a correct decision. If a trial does not result in a correct decision at any iteration, the full number of iterations is applied.

Artificial data
The most important requirement for the methods is that their performance should at worst drop to the baseline performance of the no stopping strategy whenever data become difficult to separate. In other words, they should gracefully degrade. Several artificial datasets were constructed to test the robustness of the stopping methods for data distortions which are typically observed in classifier outputs during BCI online sessions (see figure 2). Unless stated otherwise, data are sampled from two normal distributions (target and non-target) with unit variance. They represent the classifier outputs and not the EEG data or features thereof. A training set and a validation set were generated independently. For the calculation of the SPM metric, all of the following datasets were assumed to have a SOA of 200 ms, six stimulus classes, a two-step interface, and an overhead of 10 s per step. The influence of the following distortion types were investigated (but not interactions between different types): Separation. To mimic data that is generally difficult to classify, the distance between the target and non-target class mean is varied from zero standard deviations (STD) for no separability, to three STD (good separability), effectively changing the AUC. This is done equally for both training and validation data. Based on the results, all other datasets were constructed with a relative high AUC value of 0.85.
Outliers. Artifacts in the EEG create outliers in the classifier outputs. These can generally be removed from the calibration data before training the classifier. Particularly during longer use, the number of artifacts in the online phase will be larger and removing them is not straightforward. Thus, between 0% (no outliers) and 65% (many outliers) are introduced to both classes, but in the validation data only. Outliers are randomly sampled from a uniform distribution on [−3, 3].
Confusions. A typical phenomenon in some BCIs is a systematical confusion, where one particular non-target is mistaken for a target more often than other non-targets [21]. This leads to a partial overlap of non-target and target classifiers scores. To simulate this, one non-target class was shifted from the non-target distribution (normal separation) toward the target distribution (no separation) in several steps. This is a special case of the separation dataset, where only one non-target stimulus class is affected by bad separability. It is applied to both training and validation data.
Datasets with changes in terms of drift, scale, distribution and number of classes were also tested. Since their influence on the performance was minimal, they are not described fully, but only depicted in figure 2.

EEG data
EEG data from 83 sessions were taken, originating from AMUSE [19], an auditory ERP experimental BCI paradigm, and five visual ERP paradigms: rapid serial visual presentation (RSVP) [22], movement-onset visual evoked potential (MVEP) [23], and standard visual evoked potentials from the paradigms HEX-O-SPELL, center speller and cake speller [10]. Datasets were previously recorded for the respective publications. All datasets consisted of an offline calibration part and an online spelling part; in all the original studies the users' task was to write text. Details on the datasets can be found in table 1. All calibration datasets were reduced to 24 trials. Due to the varying number of iterations per trial, there was still a discrepancy between datasets. Since it is not the primary goal here to compare datasets, this was not considered a problem. All online results are based on 21 training trials, to resemble the amount available in one cross-validation fold. Only results explicitly testing the influence of training data had a varying number of training trials.

Pre-processing and classification.
All data were high-pass filtered above 0.1 Hz during recording. For the current analyses they were further low-pass filtered (below 30 Hz) before being downsampled to 100 Hz and segmented according to the stimulus onset. Single epochs were baselined to the interval 150 ms prior to the stimulus onset. For feature extraction, data were further downsampled using a priori knowledge to use a higher sampling density for early components (30-350 ms: average over 30 ms bins) than for late components (360-800 ms: average over 60 ms bins). All data prior to 30 ms, and after 800 ms post stimulus were discarded for classification.
The resulting feature vector of an epoch is rather large, which requires regularization of the classifier. Here, an LDA classifier is used with shrinkage regularization of the covariance matrix [24]. It outputs distances to the learned separating hyperplane, the sign of which indicates the predicted class label.

Metric
For the evaluation and optimization of early stopping methods, a metric is needed which respects the speed-accuracy trade-off under consideration of the overhead. The information transfer rate (ITR) [25,26] is an obvious performance measure, as it is predominant in the BCI literature. However, for spelling applications it is more straightforward and intuitive to use the SPM metric [27]. In addition to the overhead, it considers the application specific error handling, the benefit of which is aptly shown in figure 1. Typically, a correct selection counts as +1 symbol, and an erroneous selection counts as −1 symbol, to account for the added effort of a necessary subsequent corrective action. SPM is defined as follows: effective correct = percent correct -percent erroneous (18) (N), number of channels, modality (A = auditory, V = visual), number of classes, steps (number of trials per character), the maximum number of iterations in the original data, the SOA and the overhead (time per trial that needs to be added to the stimulation period). As AMUSE is exclusively auditory, the labels were read out to the user, explaining the larger overhead. where decisions per minute considers the overhead and stimulation, as discussed in the introduction. In a two-step selection process, errors can be made on the first level (group selection) and on the second level (single symbol selection). When the group selection is false, the second level contains the possibility to select an 'undo' action, which cancels the previous selection. As no symbol is selected after such a meander, it counts as a 0 symbol selection. Percent correct is now interpreted as the accuracy on both levels and percent erroneous as the percentage of erroneous selections on the second level. Note that the simplifying assumption that the accuracy is the same for each symbol was made (similarly to the assumption of the ITR as defined in [26]).

Dynamic stopping optimization
All methods have one or more hyperparameters S, which allow the user to influence the trade-off between speed and robustness. It requires some form of optimization within the training function L. Interestingly, most of the original publications do not provide a way to properly estimate the value of this hyperparameter. Therefore, the hyperparameter S is optimized using a resampling procedure on the classifier scores obtained from a two-fold cross-validation on the training trials (see section 3). Two-thirds of the training trials are randomly selected to estimate the necessary data statistics (for instance the intensities in Lenhardt or the means, priors and variances in Zhang). Given these statistics, the remaining data is used to optimize hyperparameter S for SPM, using a simple linear search. This procedure is repeated 15 times and the median hyperparameter is subsequently applied to the test trials.
Some methods ((Fixed) optimal, Höhne, Jin) do not require such a compound optimization. However, they too are estimated using the resampling procedure for extra robustness and better comparability. In these cases the hyperparameter is optimized on the larger two-thirds of data split directly. In the case of two mutually dependent hyperparameters (Jin and Lenhardt), they are estimated simultaneously, by extending the linear search to a two dimensional grid search.

Directly estimating S
The size and heterogeneity of the datasets can be used to find proxies for the optimization of hyperparameter S. For this purpose, the optimized values of S for all training sets were investigated for dependencies on data features. Any resulting independent variable should be chosen such that it can easily be derived from a calibration dataset. As the AUC of the training data is highly correlated with S, and is independent of the sample size, it was chosen as the independent variable. The dependent variable S was then linearly fitted to the AUC of the training data. Fitting was done using iteratively reweighted least-squares analysis [28] to reduce the influence of outliers. This gives two coefficients for each method, which transform the AUC of a dataset into a corresponding hyperparameter S † , which approximates S. All EEG datasets were re-analyzed using this approximation by S † , and the performance was compared to the optimized variant. In the case of multiple hyperparameters, the one with the highest correlation coefficient was approximated whilst fixing the second one. For Jin, j mean was fixed to the minimum of two iterations, and for Lenhardt, S 1 was fixed to 1.

Experimental procedure
All analyses were performed offline, but maximum generalization was ensured by leaving out all online data for validation. The experimental protocol is depicted in figure 3 and followed these steps: (i) EEG data are split into training and test trials (see conditions below). (ii) Training trials are transformed into classifier scores through two-fold cross-validation. Additionally, a classifier is trained on all training trials and applied to the test trials. (iii) The classifier scores from the cross-validation are used to obtain for each stopping method. The trained stopping methods are then applied to the classifier scores from the test trials, to determine the stopping point for each trial. (iv) Accuracy and stimulation duration are transformed into SPM. The resulting SPM of the validation data is used throughout this manuscript, unless stated otherwise.
Two conditions need to be distinguished. Offline: the above procedure takes place inside an eight-fold crossvalidation. Data comes from the original calibration data only, and at each fold 21 trials are used for training and three trials are used for testing. All parameters are estimated within the cross-validation procedure. Quasi-online: this condition was performed once, after the optimization procedures performed satisfactorily. Here, 21 trials of the calibration data were used as training trials, and the test trials consisted of all online data.
The procedures were the same for the artificially generated data, with the addition that the experiments were run 40 times per condition. All reported performance gains are calculated with the no stopping condition as a baseline condition, by subtracting the no stopping performance. Negative performance gains thus mean that the stopping method results in a worse performance than when no stopping is applied.

Artificial data
All of the methods are quite robust to most types of data distortion. The largest performance deteriorations were caused by the class separation, outliers and systematical confusions, which are shown in figure 4(B). Results are summarized by their mean performance gain over the no stopping condition (see figure 4(A)) over the given parameter space. The gain of a specific setting was considered significant if at least 90% of the 40 simulations yielded an output on the same side of the baseline (black bars). Bars are gray where no such significance was found.
Separation. Most of the methods behave well for data that is difficult to separate, meaning that they do not decrease performance below the no stopping baseline when data separation goes to zero. Liu clearly struggles with the data as they becomes difficult to separate, as they severely undershoot the baseline performance. This is reflected in the relatively low average gain. To a lesser extent, Jin suffers from the same effect for the most difficult data. For difficult data, Rank diff shows a constant and tiny, but significant reduction. Zhang is exceptionally resilient to inseparable data, with a significant performance gain for all but the most inseparable data.
Outlier. All of the methods have severe problems in dealing with outliers. None of the methods reached a positive average gain over the given parameter space. Though Liu performs worst for outliers, here the picture is not that clear. In fact, all of the methods severely deteriorate performance compared to the baseline when many outliers are present, though the level of tolerance differs slightly between methods. It is important to note that the number of outliers went as high as 65%, and that these outliers were not observed in the calibration data.
Confusion. All methods are influenced by systematic confusions, but they degraded gracefully to the baseline. Interestingly, also Liu does not show any significant performance reduction from the baseline, although confusion is a special case of separation.
The results for all of the datasets can be found in the supplementary material (see supplementary figure 1 (available from stacks.iop.org/JNE/10/036025/mmedia)). It is worth noting that Liu is the only method that is clearly not robust against drifts, showing a performance peak around the no-drift parameter and falling off to both sides with positive and negative drifts. Furthermore, Liu, Rank diff, and Zhang show reduced performance gain when the data is downscaled. For all of the other methods, and for upscaling, only marginal influences were found. An increased number of classes reduced the gain for all methods, and a more skewed distribution showed higher gains for all methods.

EEG BCI data
While the results on the artificial data give an intuition of the properties of the various methods, the truly relevant question is their behavior on real BCI data, which is summarized in figure 5. On average over all datasets, all methods improve the performance. However, this does not hold when observing the paradigms individually.
All of the methods significantly increase the performance on two datasets (center speller and RSVP speller), and all methods apart from fixed and fixed optimal managed to increase the performance on the cake speller dataset (for significance levels see figure 5). Performance on the hexo-spell dataset is significantly increased by all but Liu. A significant increase in performance for the MVEP dataset is found for Rank diff, Zhang, and Lenhardt. None of the methods significantly increases (or decreases) performance on the AMUSE dataset.  In summary, no method increased performance significantly on all datasets. Rank diff, Zhang, and Lenhardt did so for five datasets, Höhne and Jin for four datasets, and fixed, fixed optimal, and Liu for three datasets. Figure 6 shows the average gain over no stopping for each stopping method, accumulated over datasets. Over all datasets, Höhne, Lenhardt, and Zhang performed particularly well, as indicated by the high cumulative average gain (4.17, 4.14, and 4.47 respectively). Directly after these three methods, Jin, and Rank diff follow with cumulative gains of 3.11, and 3.49. The cumulative gain for Liu is competitive (2.83), in particular for the 'easier' datasets. When subtracting the negative gain for AMUSE and MVEP it reduces to 2.60.
Both of the fixed methods (fixed and fixed optimal) show a cumulative gain over no stopping (2.08 and 2.36, after subtraction of negative paradigms). However, their gains are clearly smaller when compared to most dynamic methods. A full image depicting individual gains for all subjects and all stopping methods can be found in supplementary figure 2 (available from stacks.iop.org/JNE/10/036025/mmedia). In absolute terms, the fixed methods and Liu reduce performance for about 26.5% to 30% of the subjects, against 8.4% to 15.7% for the dynamic methods.
In short, Höhne, Lenhardt, and in particular Zhang perform best in terms of gain. Liu performs good for good subjects, but underperforms on difficult EEG data. Figure 7 shows the effect of the training set size on the estimated online performance, as averaged over all datasets. With the minimum number of three training trials, all methods but Jin and Lenhardt decrease in performance with respect to no stopping, on average. With 12 training trials or more, all methods increase performance over the no stopping condition, on average. Performance for the no stopping condition converges after about 12 trials, as the accuracy saturates to 100% for several paradigms. For all stopping methods, the relative gain increases substantially during these first four conditions (three to 12 training trials). After 12 training trials, most dynamic methods keep increasing in performance until trial 21, showing a performance gain that is impossible without such stopping methods.

Required training data
Importantly, on average all dynamic methods, except for Liu, consistently outperform the fixed and fixed optimal method from six training trials upwards. Most dynamic approaches can thus exploit the data better than simply fixing a number of iterations, even with small amounts of training data. Liu needed on average 15 trials or more to outperform fixed (optimal). Though the effect size varies, the average findings largely transfer to the individual datasets (not shown).

Typical parameter settings
The hyperparameter S of most methods is particularly correlated with the AUC of the training data. Even across different paradigms, with all the different settings, the optimized S is a rather robust function of this data feature (see supplementary figure 3 (available from Figure 5. Quasi-online performance. Averaged SPM performance is given for all stopping methods (columns) and experimental paradigms (dots). The connected black dots represent the averages over experimental paradigms. It was calculated over all subjects directly and accounts for the different number of subjects per paradigm. An asterisk marks the significance level of a performance increase over no stopping (*: p < 0.05. **: p < 0.01), as found by a Bonferroni-corrected paired t-test. Figure 6. Cumulative gain. The y-axis shows the cumulative average gain with respect to no stopping), for each of the stopping methods. The height of a segment indicates the average performance gain which was realized for a particular dataset. Negative average performance gains are displayed below the zero line. It is interesting to note that the performance metric is SPM, and thus directly interpretable. Figure 7. The influence of the number of training trials on the performance of stopping methods. With a low number of training trials, most dynamic methods deteriorate the performance with respect to no stopping. When increasing the number of training trials, the dynamic methods start outperforming no stopping. Their performance increase converges at a larger number of trials, indicating a gain which is impossible to achieve without using dynamic methods. Figure 8. Performance comparison between optimized S and global S † , which has been estimated by linear regression. The optimized S boxplots are the quasi-online performance results of each method, when trained using the subject specific optimization (S). The global S † boxplots indicate the results, when the methods are re-run and S † is directly estimated from the AUC of the training data, using the coefficients in table 2. Performance differences are not significant, as confirmed by a Bonferroni-corrected paired t-test. This advocates in favor of using the simple regression method for estimating the stopping hyperparameters, which improves the usability of stopping methods for the practitioner.
The result of re-analyzing all of the data, using the above equation to determine S † , can be found in figure 8. Results are summarized by their median, 25 and 75 percentiles and minimum and maximum gain. Though the use of S † significantly decreased the training performance for several methods when compared to the optimal S (not shown), it did not significantly change the online performance for any method. Significance was tested using a Bonferroni-corrected, paired t-test.

Discussion
Several methods for dynamic stopping have been proposed in the literature, but a direct performance comparison has been difficult due the variety of paradigms, pre-processing steps, timings and evaluation metrics. Here, these methods are systematically tested within the same theoretical framework, on the same datasets and under the same conditions. Artificial data are generated and evaluated to test the methods' performances under strictly controlled conditions. Subsequently, their performance is evaluated on pre-recorded real EEG data from a variety of different BCI experiments, which cover the spectrum of ERP paradigms reported in the BCI literature well. This evaluation strategy allows us to draw conclusions on the benefits and drawbacks of these methods and to directly compare between them.
In general it can be said that early stopping, be it dynamic or fixed, increases the performance in the majority of cases. This is particularly true when a paradigm results in highly discriminative classifier scores, with average performance gains of up to 1.19 SPM for the RSVP data with the stopping method of Zhang-which represents a 78% performance increase over no stopping. Most methods can be applied even when the classifier scores become less discriminative; though they might no longer increase the performance, they reduce to the baseline performance at worst. Artificial data (figure 4) show that this is the case for all methods but for Liu and-in case of very difficult data-also for Jin. Considering patient applications of stopping methods, this is an important feature, as patient data are typically less clean and more difficult to classify. Most methods can thus be applied without worrying about performance degradation.
Adding a robust early stopping module to the ERP based BCI is a good idea, provided that enough data are available. Training an early stopping method on top of the classifier effectively means estimating more parameters from the same amount of data. This can pose a problem, as data are often limited. However, over all datasets, 9 to 12 training trials suffices for the methods to increase the online performance on average. Given the difference in epochs per trial between datasets, the actual number may vary for different paradigms. Though nine trials is an amount that is normally available for training, the more difficult the data are, the more trials may be needed. For difficult datasets (such as MVEP or AMUSE), dynamic stopping methods need more data to perform reliably.
Before additional modules are added to a complex BCI system, another aspect needs to be taken into consideration: the tested methods have a different implementation complexity, though it is difficult to quantify this. The fixed methods are obviously the simplest. Taking the best performing methods Lenhardt, Rank diff, Höhne and Zhang, the simplest may arguably be considered Höhne, given that the necessary statistical tests are part of most data analysis toolboxes. The Bayesian statistics behind Zhang could be considered more complex and requires a custom implementation. Rank diff and Lenhardt are rather simple to implement, and fall somewhere between the other two. All methods showed runtime performances that are well in the range of what would be applicable online (typically between 3 and 27 msec per decision). As the methods were not optimized for runtime performance, these values can most likely be reduced.

Dynamic versus fixed
Early stopping methods allow the BCI to adapt to the current state of the user. In doing so, they can reduce the time required for a selection, and thus increase the robustness of the system. The latter seems to be particularly true for dynamic methods, since they can adapt to changes in the data. Though fixed and fixed optimal are the simplest methods and require very little adaptation of the BCI system, they are not dynamic; during online application the number of iterations is fixed regardless of the data coming in.
On average the fixed methods do increase the performance with respect to no stopping. Unfortunately, their performance gains leave much room for improvement, which is observed both in artificial data and real EEG data. This indeed is due to the fact that these methods are incapable of adapting to changes in the data distributions over time. When observing the online data, parts of the data that have an AUC below average show the largest reduction in performance for the fixed methods. Interestingly, they also show the largest performance improvement when non-stationarities temporarily lead to a higher SNR of the data.

User-optimized S versus global S †
As most of the original publications do not specify how to find an appropriate value for the hyperparameter (here called S), a general optimization scheme is proposed. With the SPM, it uses a realistic metric as objective function and leads to considerable performance gains. To optimize S, realistic simulations for the online performance for different values of S are required, which can be time consuming and error-prone to implement. For the BCI practitioner, a look-up table or regression function would be convenient. For instance, such a regression function would help to determine an at least suitable hyperparameter for a new BCI user, given some collected training data. The empirical observation that S is highly correlated with the AUC led to a set of coefficients that translate a given AUC of the training data into an approximation of S, called S † . Using S † , the online performance gains were similar to the performance gains using S.
Apart from an interesting basic finding, this result has practical relevance. The AUC can be estimated reliably with a relatively low amount of data, say a few trials. Using the regression coefficients, the problem with training on a low number of training trials may thus be overcome. But even with appropriate training size, they alleviate the practitioner of the task of running simulations to optimize S. They thereby reduce the barrier for using stopping methods in similar paradigms.

Benchmark data
The classifier scores of the EEG data used in this study are available in the supplementary material (available from stacks.iop.org/JNE/10/036025/mmedia). It contains all data required for reproducing the current results and, more importantly, for benchmarking future methods. Classifier scores for each fold are given, such that over-fitting can be avoided and realistic comparisons can be made. Furthermore, example scripts were created to facilitate the use of these data. They take care of the proper cross-validation. Only one training and one test function need to be written for any new method.
Be aware that the implementation that is the basis for this paper differs slightly and contains a random factor; results will thus not exactly match those from the example script.

Limitations of the study
With offline analyses, the danger of overfitting the data is always tempting, and online validations are considered to be the last conclusive evidence. In this case, the protocol was kept strict so as to mimic such a validation, leaving the online data untouched until the very end. As running online trials with several stopping methods active at the same time would be impossible, a block structured protocol would be necessary. The current quasi-online protocol bypasses this need, whilst keeping the validity of the test intact. One clear difference between the two is that for the latter, the human is not in the loop. Potentially, shortening the trials can increase or reduce the workload on the user. In particular when an early stop results in a false decision, the user might get frustrated. This study shows the principle validity of the tested methods. Influences of such methods on the human in the loop remain an open topic, though at least for Jin [20], Lenhardt [16], and Rank diff [19] there is partial online evidence from direct comparison.
Method Liu differed slightly from the original implementation, to comply with the testing framework. The linear kernel support-vector machine [17] originally employed, is very similar to the LDA used here. This design difference is not likely to be the cause for the poor performance. The effect of temporally smoothing the features, as proposed by Liu, could result in cleaner data. At the same time it introduces strong dependencies between samples, which violates the assumptions of most classifiers. The influence of this on the current results is unclear.
For the sake of comparability, the current analyses were done using a single linear classifier. LDA is the most prominent classifier in the ERP based BCI literature, and particularly suited for ERP data [24]. However, given that for all methods the gained performance depends on the AUC of the classifier data, it could be expected that any improvement made to the AUC-by smarter feature extraction or better classificationcould pay off double: it can improve the performance in itself, and additionally benefit more from the stopping methods. The influence of different, potentially nonlinear methods remains an open question.

Conclusion
This study for the first time benchmarks several methods for optimizing event-related potentials (ERP) based braincomputer interface (BCI) under the same conditions. The results clearly show the benefit of stopping methods, both fixed and dynamic. In particular, methods Zhang, Höhne and Lenhardt can be recommended for high performance. Caution is to be taken with Liu, in particular for subjects with less discriminative data, as it can clearly reduce performance with respect to the baseline. All in all, these methods could be highly instrumental in making ERP based BCIs even more efficient and adaptable to transient changes in data characteristics, both of which are important for long term use of a BCI.