Channel capacity in brain–computer interfaces

Thiago Bulhões da Silva Costa; Luisa Fernanda Suarez Uribe; Sarah Negreiros de Carvalho; Diogo Coutinho Soriano; Gabriela Castellano; Ricardo Suyama; Romis Attux; Cristiano Panazio

doi:10.1088/1741-2552/ab6cb7

1. Introduction

Brain-computer interfaces (BCI's) are systems that allow a connection between brain activities and machine instructions to be established in such a way that the traditional nervous and muscle pathways need not be used (Vidal 1973, Wolpaw et al 2002, Kübler and Müller 2007, Abiri et al 2019). To make this direct connection trustworthy, several indices have been proposed to assess the effectiveness of these systems (Schlögl et al 2007, Thompson et al 2014). In the BCI community, the most common of them is the information transfer rate (ITR) (Yuan et al 2013, Sadeghi and Maleki 2019). In fact, many works have been using this metric to evaluate their contributions to the area (Yin et al 2015, Xu et al 2016, Cao et al 2017, Eggers et al 2018, He et al 2018, Shin et al 2018, Jiang et al 2018, Han et al 2019, Ingel et al 2019, Zhang et al 2019).

Originally grounded in the fields of information theory and communication systems, the ITR is in essence a measure of the quantity of information conveyed by a channel during a data transmission protocol (Cover and Thomas 2006, Proakis and Salehi 2008). Evidently, all sort of channels found in nature or built by human beings have a maximum ITR above which no reliable communication is guaranteed (Shannon 1948). In the aforementioned fields, this limit rate is known as channel capacity.

Some of these concepts have undeniably already been brought to the BCI context (Vidal 1977, Wolpaw et al 1998, Schlögl et al 2003, Kronegg et al 2005, Fatourechi et al 2006). Consider, for instance, the usual ITR formula, given by equation (1).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle ITR = \frac{60}{T}\left[\log_{2}M + p\,\log_{2}p + (1 - p)\log_{2}\left(\frac{1 - p}{M - 1}\right)\right] \label{eq1} \nonumber \end{align} \tag{ 1 }$

In this expression, M is the number of classes, p is the expected value of classifier accuracy and T is the time window to transmit each decision (Wolpaw et al 1998, Yuan et al 2013). This basically represents a transmission rate, in symbols per minute, times a particular capacity of a discrete memoryless channel, in bits per symbol (Fano 1961, Ash 1965, Cover and Thomas 2006). In theory, such ITR imposes the limit for lossless communication in a BCI experiment.

However, even when considering the model of a discrete memoryless channel, a limiting hypothesis by itself, equation (1) is only valid under even more restrictive assumptions: (1) the number of possible sent and received symbols is equal, (2) the hit rates of each sent symbol are also equal and (3) the respective misclassifications are equally distributed among the unsent symbols. These three statements refer to a particular case known as doubly symmetric—or doubly uniform—channel (Fano 1961, Ash 1965). Opting for this model, the entire information of the (rarely symmetrical) confusion/error matrix—the main descriptor of the classifier behavior (Fukunaga 1990, Congalton 1991)—has not been taken into account to evaluate a BCI.

In view thereof, the present work proposes the use of a more general channel capacity—one that considers the error matrix—to compute the ITR and, in addition, it argues that this approach can be extended to rank and to select the best symbols to be transmitted through the channel. The last proposal represents a convenient form to identify classes that worsen the overall performance of a BCI system.

2. Capacity of a discrete memoryless channel

Consider a process $X\rightarrow Y$ , with input and output symbols, respectively, modeled by the random variables $X\in \mathcal{X}$ and $Y\in \mathcal{Y}$ , in which Y is related to X by the conditional probability distribution $p(Y/X)$ . So, a discrete memoryless channel is a system of probabilities—with an input alphabet $x_1, x_2, \ldots, x_L$ ; an output alphabet $y_1, y_2, \ldots, y_M$ ; and a fixed transition matrix ${\bf P}=[P_{ij}]$ , where $P_{ij} = p(Y=y_j | X = x_i) = p(y_j | x_i)$ (equation (2))—wherein the output at every moment does not rely on past or future entries (Fano 1961, Ash 1965).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\bf P} = \left[ \begin{array}{@{}cccc@{}} p(y_1/x_1) & p(y_2/x_1) & \ldots & p(y_M/x_1) \nonumber \\ p(y_1/x_2) & p(y_2/x_2) & \ldots & p(y_M/x_2) \nonumber \\ \vdots & \vdots & \ddots & \vdots \nonumber \\ p(y_1/x_L) & p(y_2/x_L) & \ldots & p(y_M/x_L) \end{array} \right]. \label{eq2} \nonumber \end{align} \tag{ 2 }$

In order to assess the capacity of this channel, the general procedure employs the mutual information measure $I(X; Y)$ (equation (3a)), which denotes the entropic deviation between $H(Y)$ and $H(Y/X)$ —the former a self-information in the realization of Y and the latter an average uncertainty remaining on Y after the realization of X (Shannon 1948, Fano 1961, Ash 1965, Cover and Thomas 2006, Rioul 2018).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle I(X; Y) = H(Y) - H(Y/X) \label{eq3a} \nonumber \end{align} \tag{ 3a }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle =\sum_{j=1}^{M} p(y_j)\log\frac{1}{p(y_j)} - \sum_{i=1}^{L}\sum_{j=1}^{M} p(x_i)\,p(y_j/x_i)\log\frac{1}{p(y_j/x_i)}. \label{eq3b} \nonumber \end{align} \tag{ 3b }$

Because the transition probabilities $p(y_j/x_i)$ are constants, $I(X; Y)$ is only a function of p (x_i). On the one hand, the first term in equation (3b), $H(Y)$ , is strictly concave in p (y _j)—so, if it has a maximum, it is unique. But $p(y_j) = \Sigma_{X} p(x_i)\,p(y_j/x_i)$ is linear in p (x_i), that is, $H(Y)$ is also concave in p (x_i). On the other hand, the second term, $H(Y/X)$ , is clearly linear in p (x_i). Accordingly, $I(X; Y)$ is concave in p (x_i) and thus may have a maximum (Cover and Thomas 2006, Rioul 2018). Therefore, the task of evaluating the channel capacity C is based on finding a probability distribution $p(X)$ that maximizes the mutual information (equation (4)) (Shannon 1948, Fano 1961, Ash 1965).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle C = \max_{p(X)} I(X;Y). \label{eq4} \nonumber \end{align} \tag{ 4 }$

This computation is straightforward when the transition matrix exhibits special properties. For instance, a channel is doubly symmetric if each row of ${\bf P}$ has the same set of probabilities $p_1, p_2, \ldots, p_M$ and each column has the same set of probabilities $q_1, q_2, \ldots, q_L$ . In this case, $H(Y/X)$ does not depend on p (x_i), but only on $p_1, p_2, \ldots, p_M$ . In turn, $H(Y) \leqslant \log M$ with equal sign holding if and only if $p(Y)$ is uniform. But this is precisely the matter when $p(X)$ is uniform. Thus, the capacity of doubly symmetric channels reaches a closed-form expression (equation (5)) (Ash 1965).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle C_{sym} = \log M - \sum_{j=1}^{M} p_j\log \frac{1}{p_j}. \label{eq5} \nonumber \end{align} \tag{ 5 }$

In the BCI community, two doubly symmetric channels, both considering L = M, have already been suggested to calculate the ITR (Farwell and Donchin 1988, Wolpaw et al 1998). The first one is when ${\bf P}$ is the identity matrix, which represents the assumption that no error occurs during the experiment. That is, the capacity formula becomes dependent only on the number of symbols (equation (6)).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle C_0 = \log M. \label{eq6} \nonumber \end{align} \tag{ 6 }$

The second one is when ${\bf P}$ contains the following properties: $p(y_j/x_i) = p$ if j = i, which denotes that the symbol successes of the input/output M-ary alphabet are equal; and $p(y_j/x_i) = (1 - p)/(M - 1)$ if $j \neq i$ , which denotes that the symbol errors are equally distributed. As mentioned before, these assumptions lead to the most used capacity formula (equation (7)) to calculate the ITR of BCIs. It should be noted that, in these formulas, the units of capacity are given in bits per symbol when taking logarithms to the base two (Hartley 1928, Shannon 1948).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle C_1 = \log M + p\log p + (1 - p)\log \left(\frac{1 - p}{M - 1}\right). \label{eq7} \nonumber \end{align} \tag{ 7 }$

Notwithstanding the lack of generality so far, there is a formulation for the problem of finding the capacity of a discrete memoryless channel using the method of Lagrange multipliers $(\lambda)$ (Shannon 1948, Muroga 1953, Fano 1961, Ash 1965). Although this approach does not cover constraints of the form $p(x_i) \geqslant 0$ , its goal is trying to maximize equation (3b), subject to $\Sigma_X p(x_i) = 1$ , by taking the partial derivatives with respect to each p (x_i) equal to zero (equation (8)), expecting the result does not encompass negative values for p (x_i) (Ash 1965).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \frac{\partial}{\partial p(x_k)}\left[I(X;Y) + \lambda \sum_{i=1}^{L}p(x_i)\right] = 0, \qquad k = 1, 2, \ldots, L \label{eq8} \nonumber \end{align} \tag{ 8 }$

equation (8) leads to a set of transcendental equations (equation (9)):

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\bf P} \cdot \left[ \begin{array}{@{}c@{}} 1 - \lambda + \log p(y_1) \nonumber \\ 1 - \lambda + \log p(y_2) \nonumber \\ \vdots \nonumber \\ 1 - \lambda + \log p(y_M) \end{array} \right] = \left[ \begin{array}{@{}c@{}} -H(Y/X = x_1) \nonumber \\ -H(Y/X = x_2) \nonumber \\ \vdots \nonumber \\ -H(Y/X = x_L) \end{array} \right] \label{eq9} \nonumber \end{align} \tag{ 9 }$

that, in the more general case, is only computed by numerical methods, which provide a range for C (Meister and Oettli 1967, Arimoto 1972, Blahut 1972). However, when L = M and ${\bf P}$ is nonsingular, the capacity attains a closed-form expression (equation (10))—in which r_ji is the element in the j th row and ith column of ${\bf P}^{-1}$ and H(Y/X = x_i) is the entropy of Y given that X = x_i (Muroga 1953, Fano 1961).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle C_2 = \log \sum_{j=1}^{M}2^{h_j}, \qquad h_j = -\sum_{i=1}^{M}r_{ji}H(Y/X = x_i). \label{eq10} \nonumber \end{align} \tag{ 10 }$

But equation (10) is only valid if, for each $k = 1, 2, \ldots, M$ , there is a corresponding d_k > 0 (equation (11)) such that $p(x_k) = 2^{-C_2}d_k$ , which is the distribution that achieves the capacity (Ash 1965).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle d_k = \sum_{j=1}^{M}r_{jk}2^{h_j}. \label{eq11} \nonumber \end{align} \tag{ 11 }$

If any d_k is negative, the solution, when it exists, is not acceptable, because it falls outside the region determined by $p(x_i) \geqslant 0$ , a natural constraint not included in the method of Lagrange. In this case, the largest $I(X;Y)$ occurs at the intersection between $\Sigma_X p(x_i) = 1$ and one of the p (x_k) = 0, which indicates that at least one of the input/output symbols must be removed to reevaluate the capacity (Fano 1961). Conveniently, this strategy of eliminating symbols, whenever any d_k is negative, can be adapted to perform class selection in a BCI experiment. Although there is no direct way to decide which classes of a set must belong to a supposedly optimal subset, a reasonable and efficient choice is the progressive wrappers as search algorithm to include one class at a time using forward selection (Kohavi and John 1997). In general, this method incorporates knowledge from an evaluation function, such as a classifier output, without storing its specific structure and coefficients (Lal et al 2006). Similarly, d_k and C₂ can jointly play this role of evaluator.

As a last consideration regarding what has been introduced so far, we can conveniently establish a conceptual grounding for the given capacities in the context of BCIs. In this sense, C₀ represents an upper bound related to the ideal lossless experiment, C₁ a lower bound considering that symbol errors are equally distributed, usually the worst case because of the maximum error entropy, and C₂ an intermediate value providing a better estimate. Clearly, such interpretation can be readily extended to a general formula for the ITR (equation (12)):

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm ITR} = \frac{60}{T}C_n, \qquad n = 0, 1, 2. \label{eq12} \nonumber \end{align} \tag{ 12 }$

3. Experimental design

An open dataset with 40 classes is the underlying scenario to analyze the usability and relevance of the aforementioned channel capacities. It consists of a steady-state visually evoked potential (SSVEP)-based BCI experiment performed with 35 healthy volunteers (Wang et al 2017) and which, compared to other BCI databases, has the interesting feature of allowing a large number of classes to be varied and tested—the main reason to have been chosen.

The classes of the dataset consist of 40 visual stimuli ranging from 8 to 15.8 Hz with a step of 0.2 Hz. For each stimulus, six sessions of 5 s were registered with electroencephalography (EEG) of 64 electrodes, according to the extended 10–20 system. During acquisition, the data were sampled at a rate of 1 kHz and filtered with a bandwidth of 0.15–200 Hz and a notch at 50 Hz—while the skin-contact impedances did not exceed 10 k $\Omega$ . Subsequently, the data were downsampled to 250 Hz. The original work presents further details (Wang et al 2017).

In this study, the only additional preprocessing is the concatenation of the six sessions in a single data block and the ulterior windowing in segments of 4 s with overlapping of 3 s, to simulate an online BCI with response delay.

Due to the dataset, the canonical correlation analysis (CCA) is a convenient choice to identify the SSVEP classes (Lin et al 2006, Chen et al 2015). It is a multivariate technique employed to infer a jointly statistical measure of two sets described by multi-dimensional random variables, $\boldsymbol{X}$ and $\boldsymbol{Y}$ . To do so—being ${\bf R}_{\boldsymbol{XX}}$ , ${\bf R}_{\boldsymbol{YY}}$ and ${\bf R}_{\boldsymbol{XY}}$ the expected values of $\boldsymbol{XX}^T$ , $\boldsymbol{YY}^T$ and $\boldsymbol{XY}^T$ respectively—CCA seeks the coefficient vectors $\mathbf w$ and $\mathbf v$ that maximize the correlation $\rho$ between the linear combinations of both projections ${\bf w}^T\boldsymbol{X}$ and ${\bf v}^T\boldsymbol{Y}$ (equation (13)) (Hotelling 1936, Härdle and Simar 2007).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \max_{{\bf w},{\bf v}}\rho ({\bf w}^T\boldsymbol{X}, {\bf v}^T\boldsymbol{Y}) = \max_{{\bf w},{\bf v}}\left[\frac{{\bf w}^T{\bf R}_{\boldsymbol{XY}} {\bf v}}{({\bf w}^T{\bf R}_{\boldsymbol{XX}} {\bf w})^\frac{1}{2}({\bf v}^T {\bf R}_{\boldsymbol{YY}} {\bf v})^\frac{1}{2}}\right]. \label{eq13} \nonumber \end{align} \tag{ 13 }$

In order to detect SSVEPs from M different stimuli, $\boldsymbol X$ refers to each channel block data window and $\boldsymbol Y = \boldsymbol Y_{m}$ refers to each series of sine and cosine waves considering the harmonics of mth ( $m = 1, 2, ..., M$ ) stimulation frequency f _m, in such a way as to build a set of reference signals (equation (14)):

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \boldsymbol Y_{m} = \left [ \begin{array}{@{}c@{}} \sin (2 \pi f_{m}t) \nonumber \\ \cos (2 \pi f_{m}t) \nonumber \\ \vdots \nonumber \\ \sin (2 \pi Q f_{m}t) \nonumber \\ \cos (2 \pi Q f_{m}t) \end{array} \right ], \qquad t = \frac{1}{F}, \frac{2}{F}, ..., \frac{N}{F}, \label{eq14} \nonumber \end{align} \tag{ 14 }$

in which Q is the number of harmonics, N is the number of sampling points and F is the sampling rate (Lin et al 2006, Zhang et al 2014).

The solution to this problem leads to canonical vectors, which map the space of EEG measurements onto a new space of canonical variables related with stimuli frequencies, and also yields canonical coefficients, which rank those variables. Therefore, for each possible evocation, the largest canonical coefficient $\rho _{m}$ is the feature of interest, which fuels a decision-making process based on maximum value (equation (15)), to indicate the corresponding SSVEP (Lin et al 2006, Zhang et al 2014).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle f_{\rm SSVEP} = \arg \max_{f_m} \rho _{m}, \qquad m = 1, 2, ..., M. \label{eq15} \nonumber \end{align} \tag{ 15 }$

With the offline execution of this classification procedure and the subsequent composition of the error matrix, the indices d_k and the capacity C₂ are matched with a wrapper algorithm and can be used to distinguish suitable classes for a posterior online BCI experiment. That is, in the sense of communication systems, to perform a channel assessment before actually transmitting new symbols.

For pattern recognition, the wrapper algorithm normally accomplishes feature selection using the performance of some discriminant structure (Kohavi and John 1997, Lal et al 2006, Carvalho et al 2015). In this work, the concept is similar, but in terms of class selection using the channel capacity. Its implementation requires three aspects: the search strategy, the stop criterion and the evaluation function.

The first one consists of an efficient search in the space formed by all possible class combinations with at least two classes. Due to the large number of subspaces, the greedy heuristic based on forward selection, compared to recursive elimination (another common strategy), accelerates the search process despite not ensuring global convergence. The second one indicates to what extent that space must be inspected. Even though full selection is time-consuming, the complete search is essential in this case to study the behavior of the algorithm. Finally, during iterations, the third one assesses the outcomes of each class combination, provides the basis for the maximum-value decisions and then influences the final subspace. As already announced, d_k and C₂ perform this role, through the steps summarized in table 1.

Table 1. Steps of a wrapper algorithm with d_k and C₂.

	Forward selection from a set with M classes
1:	Start a subset S = {1} and a subset $T = \{2, 3, \ldots, M\}$
2:	Repeat
3:	Combine S with each element of T
4:	For each combination
5:	Perform decision-making
6:	Compose the error matrix
7:	Compute d_k
8:	If d_k > 0 compute C₂ Else make C₂ = −1
9:	Store C₂
10:	End
11:	Identify the combination with the highest C₂
12:	Extract the element of T associated with this combination
13:	Transfer this element to S
14:	Store the highest C₂
15:	Until $\newcommand{\e}{{\rm e}} T = \emptyset$
16:	Identify the subset with the highest among the highest C₂'s
17:	Choose this subset as the one with optimal classes

Beyond the proposed class selection, the data are also examined in the original order of class labels provided with the dataset package—only to establish a benchmark, since this order is nothing special. So, starting from the first two classes, the others are incorporated one by one into an inspection set to compute the error matrix and the respective d_k and C₂.

Finally, it should be emphasized that all processing was performed, using MATLAB^® 2015, in a workstation configured with 2 Intel^® Xeon^® E5-2670 v2 processors—10 cores and 20 threads each one—and 128 GB of RAM.

4. Results and discussion

Considering the preliminary inspection in the original order of class labels, most subjects, after a specific inclusion, start presenting negative values for d_k and inconsistent values for C₂. Indeed, as displayed in figure 1, all individuals have the first and second classes jointly suitable, but from the third onward at least one person no longer keeps the entire set. At the end, just 13 subjects get this achievement.

**Figure 1.** Reduction in the number of subjects with full performance as the classes are incorporated into an inspection set and evaluated by d_k and C₂.
Download figure:
Standard image High-resolution image

For these 13 best subjects, the curves of average capacity computed with C₀, C₁ and C₂ have almost no difference (figure 2), since the average accuracy throughout class inclusions is always above 97%—an outstanding performance apropos. This initial observation confirms C₀ as a reasonable approximation for the capacity in a BCI experiment when the volunteers maintain excellent hit rates and, alongside its simplicity, justifies its use in early works (Farwell and Donchin 1988). In such circumstances, C₁ and C₂ do not reveal additional insights.

For the other subjects, however, the curves of capacity are quite diverse, as shown in figure 3 for six representative cases—including examples of low, reasonable and near-optimal performances according to d_k and C₂ first inspection. In these other scenarios, C₀ is the ideal/faultless limit because all systems in a certain sense are built to perform with the lowest possible error. But C₁ and C₂ are more significant metrics, inasmuch as errors of a practical system are in general not negligible.

A first remark is the saturation tendency of C₁ whereas the number of classes increases and the average accuracy decreases, as depicted in the charts of subjects 1, 22 and 28. Although some curves exhibit slightly upward or downward slopes (subjects 2 and 24) and others exhibit a late plateau (subject 20), after a particular inclusion, C₁ seems not to exceed a certain threshold.

A second remark is the stop tendency of C₂ close to or exactly on that threshold, denoting a maximum number (7 classes for the subject 28, 12 classes for the subject 2, 11 for the 24 and so on) from which, apparently, the capacity would not increase substantially – a fact even confirmed by C₁. Beyond this number, as each new class inclusion makes at least one d_k negative, as already commented, there is no probability distribution that reaches the capacity.

In spite of the fact that these preliminary results introduce important aspects about the use of C₀, C₁ and C₂, only a channel assessment can indicate the real capacity of a BCI experiment. In this sense, employing the wrapper algorithm with d_k and C₂ as a joint evaluation function to select an optimal subset of classes, the progression and final shape of the capacity denote a much more auspicious operating point (figure 4).

**Figure 4.** Accuracy and capacities C₀, C₁ and C₂ for the same six representative subjects, but employing the wrapper algorithm with d_k and C₂ as a jointly evaluation function to perform class selection. In all graphics, capacity is expressed in bits per symbol.
Download figure:
Standard image High-resolution image

In this new scenario, a first remark is the practical overlap between C₀, C₁ and C₂ for at least the first twenty runs of the wrapper algorithm. Because the class selection tends to form optimal subsets, the hit rates are virtually one hundred percent in those iterations and thereby the capacity values are very close or equal. Right after this overlapping trend, until C₂ ceases, the curves smoothly detach from each other. Thenceforth, as C₂ has available all information of the confusion matrix, it provides a better estimate than C₁ for channel capacity in the vicinity of the maximum.

A second remark is that, beyond the stop point of C₂, C₁ starts asymptotically decreasing to the threshold found in the previous analysis, that is, the same final result when using the entire set of classes. This fall behavior, although not explored in this study, indicates the stop point of C₂ as a reasonable stop criterion of the wrapper algorithm and, besides, suggests avoiding as first approach the use of all classes, especially with subjects whose performance is far from ideal.

When comparing figures 3 and 4, one last decisive remark is the great improvement in the number of suitable classes and hence in the value of reached capacity, notably by observing volunteers with low and medium performance. Subject 28, for example, goes from 7 to 28 classes while her/his capacity goes from 2.56 to 4.50 bits per symbol. Furthermore, considering the 35 subjects, before and after class selection (table 2), the average C₁ goes from 4.00 $\pm$ 1.39 to 4.62 $\pm$ 0.82 bits per symbol and the average C₂ goes from 3.71 $\pm$ 1.68 to 4.79 $\pm$ 0.70 bits per symbol, both with p -value < 0.01. Curiously, such improvement does not necessarily mean, for all cases, an increase in the number of suitable classes. Instead, subjects 13, 14 and 20 achieve the capacity after a slight reduction in that number—occasionally, a economy of symbols seems to be the best option. Essentially, these evidences support the need to accomplish channel assessment before proceeding to online operation.

Table 2. Performance of all subjects (S.) before and after class (Cl.) selection. C₁ and C₂ are expressed in bits per symbol.

	Before			After			S.	Before			After
S.	C₁	C₂	Cl.	C₁	C₂	Cl.	S.	C₁	C₂	Cl.	C₁	C₂	Cl.
1	3.30	3.57	15	4.42	4.65	30	19	2.91	1.58	4	3.63	3.99	23
2	3.66	2.79	12	4.43	4.60	29	20	4.85	5.07	38	4.98	5.10	37
3	5.27	5.28	39	5.27	5.28	39	21	1.32	1.42	3	4.50	4.66	28
4	5.08	4.95	33	5.16	5.22	39	22	3.87	4.13	23	4.64	4.81	31
5	5.32	5.32	40	5.32	5.32	40	23	4.37	1.85	5	4.64	4.87	35
6	5.27	5.30	40	5.27	5.30	40	24	2.50	2.90	11	4.26	4.61	31
7	3.32	1.70	4	4.59	4.68	28	25	5.27	5.29	40	5.27	5.29	40
8	4.26	4.45	32	4.74	4.98	37	26	5.25	5.30	40	5.25	5.30	40
9	3.03	2.05	5	3.83	4.27	27	27	3.51	2.02	5	4.28	4.59	30
10	5.06	5.21	40	5.06	5.21	40	28	3.09	2.56	7	4.23	4.50	28
11	2.92	2.55	9	4.66	4.80	31	29	1.47	1.57	4	3.33	3.71	19
12	5.28	5.30	40	5.28	5.31	40	30	4.94	4.94	32	5.05	5.15	38
13	4.80	5.09	40	4.93	5.10	39	31	5.27	5.30	40	5.27	5.30	40
14	4.85	5.13	40	5.07	5.17	38	32	5.23	5.29	40	5.25	5.30	40
15	5.21	5.27	40	5.21	5.28	40	33	0.52	0.43	3	1.30	1.68	8
16	1.48	1.53	4	3.39	3.90	23	34	5.31	5.31	40	5.31	5.32	40
17	4.43	3.16	11	4.77	4.95	34	35	5.32	5.32	40	5.32	5.32	40
18	2.51	0.74	2	3.81	4.21	27

						Average	4.00	3.71	23.5	4.62	4.79	33.4
						Deviation	1.39	1.68	16.3	0.82	0.70	7.6

The latter discussion also gives us the possibility of addressing a very important issue, that of BCI-illiterate users. Broadly speaking, they are people with non-existent or very low performance when participating as volunteers in a BCI experiment (Allison and Neuper 2010, Allison et al 2010). However, applying the proposed class selection to a subject with the aforementioned characteristics in the benchmark test (figure 5, left), can highly improve their capacity curves (figure 5, right).

Subject 16, with 4 classes and a capacity of 1.53 bits per symbol, is, in principle, unable to properly use an SSVEP-based BCI (figure 5, left): in fact, after the proposed class selection she/he reaches a mark of 23 classes and a capacity of 3.90 bits per symbol. This means that a person cannot be labeled as a BCI-illiterate, without first passing through a procedure to find out which symbols/classes are most appropriate for her/his full involvement in the experiment. Evidently, subject 16 does not have an excellent performance, but reaches a much better operating point.

Even though it affords a improved estimate of the capacity of a BCI experiment, and consequently of the ITR, the proposed algorithm has a reasonably large runtime, as figure 6 reveals. The average elapsed time to compute the capacity attainable with 40 SSVEP classes takes about 12 hours, using the hardware and software previously described. In a sense, this excessive time opens up a range of perspectives to optimize such a procedure, in order to search the best classes as fast enough for online execution.

**Figure 6.** Average elapsed time of the wrapper algorithm with d_k and C₂ as a jointly evaluation function to perform class selection.
Download figure:
Standard image High-resolution image

5. Conclusion

Despite the fact that the ITR has been traditionally based on C₁, in this work we show that C₂ provides a more general closed-form expression for the capacity of a discrete memoryless channel; it has a solid interpretation associating the suitable classes of a BCI experiment with the existence of a probability distribution for the symbols to be transmitted by the channel; it supports search algorithms to perform channel assessment; and it yields a better estimate in the vicinity of the maximum. For these reasons, C₂ is a consistent measure to evaluate the effectiveness of a BCI experiment and henceforth can be equally used to compute the ITR, even though the final formula (equation (16)) is not as simple as the previous one (equation (1)).

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle {\rm ITR} = \frac{60}{T}\left[\log_{2}\sum_{j=1}^{M}2^{-\sum_{i=1}^{M}r_{ji}H(Y/X = x_i)}\right]. \label{eq16} \nonumber \end{align} \tag{ 16 }$

Furthermore, in a broader perspective, C₂ somehow represents a performance measure for multi-objective problems with the same number of input and output variables. After all, C₂ returns one single value for a general processing, if the confusion matrix is known. In BCI problems, ultimately, such measure, transferring the concept of evaluation to a common framework, can be used to compare the realizations of different subjects exposed to the same system, as worked out in this study, or of different subjects exposed to distinct systems, which is left as a future proposal.

Acknowledgments

The authors thank CAPES (Finance Code 001), CNPq (305621/2015-7, 305616/2016-1, 310582/2018-0), FAPESP (2013/07559-3, 2019/09512-0) and FINEP (01.16.0067.00) for the financial support.

Channel capacity in brain–computer interfaces

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Capacity of a discrete memoryless channel

3. Experimental design

4. Results and discussion

5. Conclusion

Acknowledgments

Channel capacity in brain–computer interfaces

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Capacity of a discrete memoryless channel

3. Experimental design

4. Results and discussion

5. Conclusion

Acknowledgments