Assessing impact of channel selection on decoding of motor and cognitive imagery from MEG data

Objective. Magnetoencephalography (MEG) based brain–computer interface (BCI) involves a large number of sensors allowing better spatiotemporal resolution for assessing brain activity patterns. There have been many efforts to develop BCI using MEG with high accuracy, though an increase in the number of channels (NoC) means an increase in computational complexity. However, not all sensors necessarily contribute significantly to an increase in classification accuracy (CA) and specifically in the case of MEG-based BCI no channel selection methodology has been performed. Therefore, this study investigates the effect of channel selection on the performance of MEG-based BCI. Approach. MEG data were recorded for two sessions from 15 healthy participants performing motor imagery, cognitive imagery and a mixed imagery task pair using a unique paradigm. Performance of four state-of-the-art channel selection methods (i.e. Class-Correlation, ReliefF, Random Forest, and Infinite Latent Feature Selection were applied across six binary tasks in three different frequency bands) were evaluated in this study on two state-of-the-art features, i.e. bandpower and common spatial pattern (CSP). Main results. All four methods provided a statistically significant increase in CA compared to a baseline method using all gradiometer sensors, i.e. 204 channels with band-power features from alpha (8–12 Hz), beta (13–30 Hz), or broadband (α + β) (8–30 Hz). It is also observed that the alpha frequency band performed better than the beta and broadband frequency bands. The performance of the beta band gave the lowest CA compared with the other two bands. Channel selection improved accuracy irrespective of feature types. Moreover, all the methods reduced the NoC significantly, from 204 to a range of 1–25, using bandpower as a feature and from 15 to 105 for CSP. The optimal channel number also varied not only in each session but also for each participant. Reducing the NoC will help to decrease the computational cost and maintain numerical stability in cases of low trial numbers. Significance. The study showed significant improvement in performance of MEG-BCI with channel selection irrespective of feature type and hence can be successfully applied for BCI applications.


Introduction
Motor disabilities and severe neurological injury require an extra measure in rehabilitation for active and effective environmental interaction. Motor imagery (MI) practice through brain-computer interface (BCI) has been found to be useful as a therapeutic substitute for standard rehabilitation practices for post-stroke patients [1,2], and an 3 Author to whom any correspondence should be addressed.
alternative approach for interaction with the environment [3][4][5][6][7] for people with severe movement disability. For post-stroke rehabilitation, the patient is required to voluntarily practice activities of daily living with very high focus [2,[8][9][10]. Advancements in BCI based technologies have shown promising results in terms of focused interaction for stroke patients [11]. Current BCI systems may use magnetoencephalography (MEG), electroencephalography (EEG), functional magnetic resonance imaging or electrocorticography approaches for mapping brain responses [7,[12][13][14][15][16][17]. Although the majority of the available BCI research is focused on EEG, MEG may provide better performance due to its higher signalto-noise ratio (SNR) and spatio-temporal resolution compared to EEG. Moreover, unlike EEG, MEG sensors are placed in a dedicated helmet rather than physically placed on subjects' scalps resulting in significant signal attenuation [18].
One of the key challenges of MEG-based BCI systems is their low accuracy [19][20][21]. Previous studies have focused on either development of novel feature extraction methods or improvement of current classification methods. Signal processing methods such as channel selection have been completely ignored in the case of MEG. It has been observed that signal preprocessing methods can improve the performance of an EEG based BCI system [22]. There is a substantial literature supporting the effectiveness of channel selection methods with EEG-based BCIs and it is thus intuitive to explore their effectiveness with MEGbased systems.
MEG systems typically have a large number of channels (NoC) for very high spatial resolutions, e.g. 306 for Elekta Neuromag Triux system. It is known that classification performance of a BCI system is dependent upon data pre-processing, feature extraction, and the use of an appropriate classifier but by selecting optimum channels, classification performance can be improved further. Provided with a large number of MEG channels, the extracted features can outnumber the trials resulting in an overparameterized classification scenario. Dimensionality reduction can help to select the most important features without affecting performance [18]. It is however observed that the use of a large NoC can affect the performance in a negative way, as those channels that are not contributing to feature separability can make the feature set noisier.
Prasad et al [2] presented promising results with five stroke patients in their neuro-feedback based BCI which facilitated motor recovery with moderate BCI accuracy. Neurorehabilitation, using MEGbased BCI by extraction of relevant information from brain activity, is a challenge. Selecting 204 gradiometers provides a higher SNR as compared to 102 magnetometers, as gradiometers are more sensitive to rate of change in cortical activations nearer to the scalp. Hence, it would be more appropriate in this instance to choose gradiometers instead of magnetometers for an MI-BCI application. This still leaves a huge NoC which will result in higher computation, irrespective of their positive or negative contribution to the performance of a BCI system.
Common spatial pattern (CSP) and its extended algorithms like Sparse CSP, have been used for dimensionality reduction and show that the channels can be selected based on large CSP vector coefficients [23][24][25] whilst maintaining a sufficient level of accuracy. Further, different CSP methods have been implemented in an attempt to increase the accuracy [19]. A channel selection method was presented by He et al [26] based on the Bhattacharya bound CSP for classification of MI-related signals. It considers the Bhattacharyya bound as an index and progressively searches for optimized channel combinations. A 95% classification accuracy (CA) was later claimed with an average of 33 channels, which was a higher accuracy than that of any other channel selection method. Zhang et al [27] used ReliefF (RF) method for EEG sensor selection for emotion classification and reported approx 9% improvement by using 30 EEG channels. Roy et al [18] presented promising results using correlation and ReliefF based channel selection in MEG. A maximum increase of 24.22% was observed for cross-validation performance. For a review of other methods implemented in the field of EEG MI, readers can refer to Alotaiby et al and Lotte et al [28,29].
In this study, our previous work [18] has been extended by implementing four state-of-theart feature selection methods, i.e. class correlation [30], RF [31,32], random forest (RandF) feature ranking [33], and infinite latent feature selection (ILFS) [34] and evaluating their performances using a four-class MEG MI BCI dataset. These methods have been used widely for image and time-series datasets. To the best of authors' knowledge, this is the first attempt to evaluate the effect of the channel selection process on performance of MEG-based MI-BCI system.
The remainder of this paper proceeds as follows: section 2 provides details of the acquired data sets, experimental paradigm, signal processing pipeline, and channel selection methods. Next, section 3 presents the performance analysis. The outcomes of the study are discussed in section 4, and section 5 summarizes the findings of this study.

Experiment and data description
For this study, an MEG dataset of 15 healthy participants was acquired using a typical MEG-based experimental paradigm which can be used for BCI as well. The participants included 12 males and 3 females with a mean age of 29.3 years ±5.96, with 13 right-handed and 2 left-handed as per self-report. The participants signed a written consent form before starting the experiments and the experimental protocol was approved by the ethics committee of Ulster University. MEG data were acquired over two sessions (each session on different days) using the same experimental paradigm. Figure 1 presents the timing diagram of the MEG-based experimental paradigm. Each trial starts with a rest period of 2 s followed by 5 s of imagery task period. The cue remained visible during the imagery task period. A randomly selected inter-trial-interval of 1.5-2 s was presented after the imagery task period. Participants were seated on  a comfortable chair approximately 80 cm away from the projector screen. Elekta Neuromag Triux system was used for recording the MEG data at a sampling rate of 1 kHz.
The experimental paradigm was designed to cover MI tasks and cognitive imagery tasks. The dataset includes four mental imagery tasks: both hand movement, both feet movement, subtraction, and word generation. During the MI-related tasks, participants imagined movement of both hands/both foot when the cue appeared at the screen. Similarly, for cognitive imagery tasks, participants either subtracted two numbers presented as cues or generated words related to an English language letter presented as a cue. Each session consisted of 50 trials for each of the imagery tasks with a total of 200 trials.

Data processing
The MEG dataset was acquired from all 306 sensors (204 gradiometers and 102 magnetometers). However, for this study, the raw data from only 204 gradiometers were used before being band-passed into three frequency bands, i.e. alpha(α) band (8)(9)(10)(11)(12), beta(β) band (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30) and alpha and beta combined (α + β) band (8-30 Hz) using a two pass Butterworth filter. Selecting 204 gradiometers provides a higher SNR as compared to 102 magnetometers, as gradiometers are more sensitive to changes in cortical activations. Bad channels were discarded due to extraneous noise, interference or artifacts. For feature extraction, data were down-sampled from 1 kHz to 500 Hz. After pre-processing, 3.5 s of the data segment related to imagery activity was selected from each trial (0.5 s after the cue onwards) and signal power was estimated for each MEG gradiometer channel for each task, i.e. hand, feet, math and word generation in each band separately. For the selection of best channels, four ranking based methods were used, i.e. classcorrelation (CC), RF, ILFS and RandF. These methods were considered for the evaluation of the effect of channel selection in MEG in the three frequency bands. Furthermore, a 10-fold cross validation CA was estimated for six binary classification tasks, i.e. hand versus feet (H-F), hand versus word (H-W), hand versus math (H-M), feet versus word (F-W), feet versus math (F-M), and word versus math (W-M) using a linear discriminant analysis (LDA) classifier (figure 2). The performance of each channel selection method was compared with the baseline condition, i.e. 204 gradiometer channels, in all three conditions.

Channel selection
Feature selection, or feature reduction, is a process of selecting a subset of relevant features based on positively contributing criteria. This process aims toward significantly reducing the noise created by relatively less important features and counter the curse of dimensionality. In this study, overall signal bandpower was estimated for each MEG channel separately and the following four methods were applied to find the best set of channels for each binary combination of classes.

CC method
The Pearson correlation coefficient is used to determine the statistical relationship between two random variables A and B. The values of the coefficients range from −1 to 1 representing no relation to direct relation, i.e. it is a measure of the linear dependence of the two random variables [30].
where µ A is the mean and σ A is the standard deviation of A. Similarly, µ B is the mean and σ B is the standard deviation of B. A represents the bandpower feature vector and B represented labels of corresponding class, i.e. 1, −1. The values of the correlation coefficients were calculated by creating a dummy label corresponding to the observations of each of the features. The dummy class label had a numeric value which indicated (L = 1) for class 1 and (L = −1) class 2.
As per the coefficients, the channels were ranked in decreasing order keeping the highest correlated channel at the top and were stored for further evaluation. The computational complexity of CC method is O(n) and running time is 0.10 section

RF method
The RF algorithm is an extension of Relief algorithm proposed by Kira et al [31,32]. The RF algorithm differs from Relief algorithm on the basis of the selection of nearest hits (same class) and misses (other class) and also on the update of attributes [32,35]. The RF method firstly initialises a random instance R i and then searches for K nearest hits (H j ) and k nearest misses(M j (C)). The weights W[A] are updated based on the values of R i , H j and M j (C) as shown in Algorithm 1. Using the RF method feature importance along with weights was found by two inputs, i.e. feature vector and their corresponding labels of 1 or −1 and was stored in an array after sorting according to their weights for further evaluation. The size of the nearest neighbors (i.e. k) was set to 25. The computational complexity of RF method is O(iSnC) and running time is 0.47 sec, S is the number of samples, n is the number of initial features, i is the number of iterations in the case of iterative algorithms, and C is the number of classes.

Infinite latent feature selection (ILFS)
This feature selection method was proposed by Roffo et al [34] where a training set is represented by a feature distribution set P = {P 1 , P 2 , . . . , P m }. Each n × 1 vector P i , is considered as the distribution of values assumed by the ith feature concerning n samples to build an undirected graph G, where nodes correspond to features and edges model relationships among pair of nodes. Edges are represented weights as a ij which is an element of A, 1 ≤ i, j ≤ n, and which model the pairwise relationship between features, assuming that feature x i and x j are good candidates. Weights can be associated with a binary function of graph nodes; where ϕ(., .) is a real valued potential function learned by the proposed approach in a probabilistic latent semantic analysis-inspired framework. Considering a weighted graph G, ILFS represents a subset of features as path connecting them. Similarly, like other methods, a feature vector of 2 classes were fed along with corresponding labels using the method developed by Roffo et al [34,36,37]. The computational complexity of ILFS method is O(n 2.37 + iN + S + C) and running time is 0.26 s. Here S is the number of samples, N is the number of initial features, i is the number of iterations in the case of iterative algorithms, and C is the number of classes.

RandF based ranking
This is a permutation technique presented by Breiman to measure the importance of features in the prediction [33] referred to as an out-of-bag importance score. At each node t in a decision tree, a split is determined by the decrease in node impurity ∆R(t).
The node impurity R(t) is the gini index. If a subset in node t contains samples from c classes, Gini(t) is defined as, whereâ 2 j is the relative frequency of class j in t. After splitting into two child node t 1 and t 2 with sample size N 1 (t) and N 2 (t), the gini index of split data is defined as: The feature providing the smallest Gini split (t) is chosen to split the node. The importance score of feature vector X j in a single decision tree T k is where I is the importance of feature X of class j and t represents node. The same is computed over all k trees in a RandF, defined as A feature vector of 2 classes was used to create an ensemble along with their corresponding labels. The minimum number of observations per tree leaf was set to 5. The time complexity for constructing a complete decision tree is O(a * n log(n)), where n is the number of records and a is the number of attributes. While constructing the RandF, it is required to define the number of trees needed to build (ntree) and number of the attributes wished to sample at each node (mtry). Since only mtry variables will be used at each node the complexity to build one tree would be O(mtry * n log(n)). Hence, for building a RandF with ntree, the complexity would be O(ntree * mtry * n log(n)) and running time is 1.81 s in our case. This is the worst case scenario, i.e. assuming the depth of the tree is going to be O(log(n)). The methods discussed above have been used for ranking of channels for session one and session two data individually. But to calculate the final set of channels, a forward elimination approach is used. The forward elimination approach helps to select a subset of channels that positively contribute toward accuracy from the set of all ranked channels. Let X be a set of 204 MEG channels. After the ranking of channels, the top-ranked channel is selected and added to an empty set X_ranked which is used for calculating CA. If the second-ranked channel positively contributes to the first ranked channel, then it is added to the set Y else the next channel is tested for its accuracy contribution. The process terminates after the addition of a channel that does not contribute positively to CA. At the end of the process, Y has only positively contributing channels. The classifier used in this instance is LDA.

Common spatial patterns (CSP)
CSP is an efficient tool to analyze multichannel data such as EEG/MEG for binary classification and provides a supervised method for decomposition of signals parameterized by a matrix W ∈ R { C×C } (C: NoC). The matrix is used to project the original sensor space E into the surrogate sensor space, using equation (7) Z = WE (7) where, is the MEG measurement of a single-trial and T is the number of samples per channel. W is the CSP projection matrix. The rows of W are the spatial filters and the columns are the CSPs. A small number of spatially filtered signals are used as features for classification purposes. These are generally the first and last m rows of Z, i.e. Z t , where t ∈ { 1, ..., 2m } . In our case, m = 1, which implies that the first and last component were considered. The feature vector is derived from Z t by equation (8) as, After the temporal filtering in mu( [8][9][10][11][12] Hz) and beta( [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] Hz) bands and spatial filtering as shown in figure 3, and finally taking the log variance using (8), we obtain the feature vector.
MATLAB 2018b was used for creating scripts for building learning models and evaluating their performance. For statistical analysis, paired-sample t-test was used. The system used Windows 10 with an i7 8th gen processor, and an NVidia RTX2080 Ti.  . The data processing pipeline using CSP. The raw MEG signals are first passed through bandpass filters in mu (8)(9)(10)(11)(12) and beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30). Then these bandpass filtered signals are spatially filtered with their respective CSP weight matrices. After spatial filtering the feature vector is formed and subsequently classified by the LDA classifier. The CSP weight matrices and the LDA classifier was trained using the session 1 data. frequency bands, CC, RF, RandF, and ILFS provided statistically significant improvement (p < 0.05) as compared to the baseline in terms of CA for all the six binary classification tasks. Moreover, for all combinations, the mixed imagery task pairs (H-W, H-M, F-W & F-M) provided higher separability as compared to the H-F and W-M task pairs.

Performance comparison using a band power feature
In reference to figure 4, the RandF method performed better than ILFS with an increased performance of 1.82% on average across all subjects over six binary classification tasks. RandF provided a statistically significant improvement over ILFS in H-F, H-M, W-M, F-W and W-M task pairs (p < 0.05). RandF did not provide a statistically significant improvement over other methods. RF and CC provided a statistically significant improvement over RandF in just two task pairs, i.e. for F-W and F-M respectively. The overall mean CA across subjects using RandF is 81.11% (±6.02), ILFS is 79.30% (±6.51), CC is 81.72% (±6. 25 Figure 5 shows the mean CA obtained using the β frequency band (12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30). This figure shows the same behavior as figure 4, where RandF has performed better than other channel selection methods. However, a point worth highlighting is the drop in accuracy compared to the α frequency band. The overall mean accuracy for session 1 using RandF is 75.65% (±5.59) Figure 6 shows the mean CA obtained using α + β frequency band  for baseline. CA is significantly higher in the α band as well as the α + β band compared to the β band alone. However, after performing channel selection the grand mean accuracy of the α band at 79.30% (±6.51) is higher than in the α + β band at 78.08% (±7.15). Figure 7 shows the CA in the three bands when the classifier was trained on session 1 data and tested on session 2. For this experiment, the channels selected for session 1 using each ranking method, were considered for evaluation in session 2. Since the session 2 is unseen data, the accuracies drop as compared to the cross-validation CAs shown in figures 4, 5 and 6. However, overall the α band (64.24% ±8.34) performed better than the β (57.88% ±7.34) and α + β (63.67% ±7.85) bands. Similarly, in the cross-session evaluation, the H-W group showed higher accuracy than the MI (H-F) group.
The effect of the addition of the top channels can be seen on CA as it starts to fall after a certain NoC (figure 8). The result was calculated on session 1 data for each participant using the MI paradigm (H-F) in the α frequency band. As observed, the CA tends to rise up to a certain NoC, i.e. 15 and then starts decreasing with further addition of channels. A few channels can also be observed to contribute negatively before reaching the peak accuracy. Therefore, by using a forward elimination approach the CA accuracy was calculated and the maximum CA was achieved with just nine channels ( figure 8(b)). Table 1 gives an overview of the total NoC involved for achieving the maximum CA for session 1 (S01) and session 2 (S02) based on the RandF method for the α + β frequency band. The NoC contributing to a maximum CA using forward elimination using all the four ranking methods in the respective bands, ranged from 1 to 23. It is to be noted that the mean of the number of selected channels across the subjects for session 1 and session 2 looks more or less similar, although there were variations in the number of selected channels for individual subjects between session 1 and session 2. Figure 9 shows the classification performance after channel selection using CSPs in all the three bands i.e. α, β, and α + β. Firstly, all the channels were ranked using RandF method with data from session 1. Then the ranked channels were added block by block starting from 15 in intervals of 10 till 125. Then the ordered channel blocks were used to train a classifier on session 1 data and test on session 2 data. This figure contains the global mean of all the binary classification tasks. A steady rise in accuracy can clearly be observed before either saturation or drop. Going by the trendlines shown in figure 9, for the band α + β, the CA hits a plateau at NoC = 75 with 72%, for α the CA hits the plateau at NoC = 85 with 69.3%, and for beta it is NoC = 95 near the plateau where the CA is 65.7%. It is to be noted that the performance of the beta band is much lower than the other two bands. This can be due to the fact that the event related desynchronization (ERD) is more consistent during inter-trial and inter-session transitions in alpha frequency range between 8 and 12 Hz. This was found in our previous study on EEG-EMG correlation where the desynchronization in 8-12 Hz gave more consistent correlation with the EMG activity than for beta band [38]. In the same study, ERD distribution for alpha band showed significantly less inter-trial variability than for beta band from the data across eight healthy participants. This could be a possible reason why alpha band performed better than beta band. Consequently, alpha + beta band also performed better than beta because the addition of alpha band led to a more consistent pattern in the sensori-motor rhythm than for beta band only. Suppression of alpha band power around 10 Hz is also a popular marker for movement planning, execution and imagery than beta band [39]. These findings indicates that the alpha band may have more impact on classification performance in the context of MI. However beta band is more related to the longitudinal changes in motor behavior, as revealed by a study on post stroke rehabilitation [40]. CSP is a spatial filtering technique which needs larger NoC to optimize the CSP projection matrix. The CSP projection matrix transforms the signals in such a way that it maximizes the variance of one class and minimizes the variance of the other class [41,42]. Providing larger NoC to the bandpower features may overfit the data due to high dimensionality as they do not have the ability to maximize the discrimination between the two classes unlike CSP. Hence, a feature vector with a lower dimension works better in the case of bandpower. On the other hand, if we feed the CSP with a larger NoC it can optimize the CSP projection matrix very well which could lead to higher discriminability between the two classes. However, it is also true that the performance of CSP saturates and drops beyond a certain NoC (as evident in figure 9) which may be due to the inclusion of lower ranking MEG channels which contribute very little or no useful information but primarily noise.

Performance evaluation using a CSP feature
As can be seen from figure 9, the performance of the α + β band is clearly higher than the alpha or beta band in the case of CSP features with channel selection, hence we further show the comparison between the CA at optimal channel set and at all channels in figure 10. Hand vs. feet shows a significant (p < 0.05) increase in performance of MI task when the optimal channel set was chosen as compared to selection of all the channels. The mean accuracy of all subjects is 70.67%(±12.69) at optimal channel set, while the mean CA is 55.8%(±13.44) when all the channels are considered. The mean of the optimal channel set length contributing maximum accuracy in hand vs. feet is 64. 33 ( Table 2 describes the maximum CA attained by each subject using the respective NoC in a binary classification task in α + β band using CSP. The channels were selected from session 1 data and tested in the same order on session 2. Figure 11 displays the contributing channels giving maximum accuracy in binary tasks classification in α + β frequency band (8-30 Hz) using RandF method. There are three images for every task, i.e. left hemisphere (LH) view, right hemisphere (RH) view, and top View. For generating figure 11 a fieldtrip brain template was used along with the SPM toolbox. The channel locations on the template are matched to the list of optimal channels obtained from the analysis via MEG channel labels and these channels are shown as markers on the head template. The plot shows the approximate location of channels with respect to the helmet. Figure 11 displays the common contributing channels for subject's binary classification tasks in α + β frequency band. The channels which are contributing positively and are common to a minimum of three subjects are plotted. First, the channels were ranked by RandF feature importance method and then they were selected using the forward elimination approach. The first triplet represents selected channels in the case of hand vs. feet task using the LH view, RH view and top view respectively. The total number of contributing channels are shown in table 1 for all the subjects and for all binary classes. The red dot   represents the common channel in session 1 and the green diamond represents common channels for session 2. The channels can be observed in motor areas for hands and feet. Some shift in channel location can also be observed. The next triplet represents the Top, LH, and RH view of the hand vs. word imagery task, respectively. Channels in motor imagery are selected in frontal, parietal lobe, and occipital lobe. The next triplet presents the channels for hand vs. math tasks. Channels are primarily in motor areas. The next triplet is for the feet vs. word classification task. Again the majority of channels are in the motor areas. The next triplet is for word vs. math imagery tasks where channels are evenly distributed over all areas of the brain. The last triplet represents contributing channels for feet vs. math. Channels in the motor areas as well as the occipital lobe can be observed. It is worth noticing that in all the categories, channels in the occipital region are selected for MI classification. As the experimental paradigm involved presentation   (8b) show the classification accuracy for a subject in a single session for the hand vs feet category calculated from the sequential addition of ranked channels and channels obtained by forward elimination method respectively using RandF method for motor imagery. of a cue (an image) to the participants, the visual cue may result in positive contributions from the occipital region channels due to visual display/ processing.

Discussion
This paper provided an empirical assessment of the impact of channel selection on MEG based BCI performance. MEG-based BCI recordings are conducted using spatially fixed sensor array, and since participants' heads are positioned inside helmet, which is not the exact same size as a participant's head size, it is possible to have slight head movements, which is unintentional. The head movements are not wanted during any MEG experiment (participants were advised to keep their head still during the experiment), but there are unintentional head movements and also different positioning in the different sessions causes change in activated sensors. Unintentional head movements and variable positioning across sessions may thus cause non-stationarity in the acquired MEG data and may result in change of optimal selection of channels across different sessions (for the same participant) and across different participants. Any movement in the initial head positions results in lower accuracy. There is a statistically significant effect of head movement on the dipole reconstruction as reported by Stokes et al [43], which can also result in lower accuracy from session-to-session transfer. Also, the NoC in MEG is much higher compared to EEG. Fewer trials in MEG based BCI demands dimensionality reduction, hence there is a need for  Figure 9. Mean between-sessions accuracy averaged over all classification tasks classification performance with the channels selected using CSP in all the three bands.
the selection of a minimum number of sensors which contributes positively in the classification of different mental tasks. Different participants have different head sizes and it would be difficult to acquire MRI of all subjects, thus a head template provided by fieldtrip has been used for this study. This paper provides an approach to improve CA in real-time with the existing MEG system by accounting for low accuracy due to head movement and head position in the helmet. The performance is evaluated with binary CAs across four different classes. Channel ranking was performed using state-of-the-art feature selection methods, i.e. CC [30], RF [31], RandF [33], and ILFS [34] followed by the selection of a set of best channels based on improvement of CA using a forward elimination method. The impact of channel selection was studied with two feature types, i.e. power in various frequency bands (α, β, and α + β) and CSP. Furthermore, the analysis was performed within-session (10-fold CV) and across-session conditions (training on S01 and testing on S02). The study provided several outcomes. Firstly, all four methods significantly improved the performance as compared to a baseline method wherein all channels were included in both within-session and cross-session conditions. Secondly, the RandF method provided more stability and overall better CA than the rest of the ranking methods, i.e. CC, ILFS, and RF. Thirdly, MEG-based imagery BCI performed better with miximagery classification tasks (i.e. combinations of CI and MI classes e.g. H-W, H-M, and F-M) as compared to motor-imagery tasks (i.e. H-F). Fourthly, the  MEG-BCI system performed better for the withinsession condition as compared to the across-session condition. Lastly, the number of best channels selected varies a lot both across sessions and subjects. Moreover, the list of best channels for a particular binary classification task changes across sessions and the pattern is consistent across all classification comparisons.
Using bandpower as a feature and the RandF method for channel selection, on average the accuracy improved by 15.79% using the α band, 15.77% using β band and 15.67% using α + β band in 10-fold cross validation over the corresponding accuracies obtained with all the channels in the respective bands. There was a marginal increase in CA in the α band compared to α + β band but there was no statistically significant improvement overall. It was also observed that ranking and selecting positively contributing channels helped to improve the accuracy. From the results it is evident that the performance of CSP for between-session accuracy is higher than the bandpower based approach and the improvement is also statistically significant (p <0.05, paired-sample ttest). The power of CSP to minimize the inter-session inconsistencies in feature distribution is evident in previous literature [44,45]. However, the use of CSP in this study is two-fold: first of all using two different features (bandpower and CSP) we showed that the application of a channel selection algorithm is feature agnostic. This means that if we use channel selection on top of any feature extraction technique popular in the BCI domain we should get higher performance than without channel selection; the second objective was to enhance the between-session accuracy to a level sufficient for practical uses such as issuing neurofeedback for rehabilitation purposes. Due to the presence of a large NoC in MEG data acquisition it is hard to avoid overfitting even after using CSP. Therefore, further channels selection was used to reduce the NoC, possibly close to a typical EEG data acquisition setting using the forward elimination procedure.
When the classifier was trained on session 1 data and tested on session 2 data there was a sharp decrease in CA with or without channel selection. However, an overall increase in CA of 4.64% was observed in the α band including all experimental conditions. With regards to binary classification an increase in CA was observed of 1.8% (H-F), 8.73% (H-W), 4.73% (H-M), 6.27% (F-W), 1.13% (F-M) and 5.20% (W-M) in the α band (p < 0.05). There was no statistically significant improvement in H-F and F-M binary classification categories. By examining the top-ranked channels in table 1, it is clear that the numbers of channels are different between session 1 and session 2. But if the mean across all subjects is observed, it will be noticed that there is not a statistically significant difference overall. Taking the same participant, and the same activity in two different sessions, a clear reduction in the channel number is seen. Also if figure 11 is observed, a substantial shift in the selected channels' pattern can be seen in H-F(LH). A similar pattern can be observed in W-M(LH). Unintentional head movements and variable positioning across sessions may cause non-stationarity in the acquired MEG data and may result in change of optimal selection of channels across different sessions (for the same participant) and across different participants. Since, this conclusion is drawn on a low number of trials, so the study provides a foundational base for classification methodology for data with a low trial count.
It is to be noted that positively contributing channels appear over all brain areas (figure 11) but there are some definite changes in selected optimal channels in cases of mixed imagery and MI. The plots show the channels common to three or more subjects which might include variability among participants. This can be one of the plausible causes of channels spread over multiple brain areas. The density of channels is however higher in the functional brain area of specific activity. For example the density of channels in motor task is more for motor area compared to other brain regions. Additionally, it is now widely known the brain connectivity networks can be altered while performing imagery tasks which might contribute toward involvement of channels from other brain regions. Several studies indicated significant changes within the fronto-parietal brain networks during cognitive and motor tasks. Thus, some additional channels may appear due to brain region connectivity changes contributing positively for classification [46,47]. The contribution of frontal and occipital areas during hand vs feet classification is likely to be because the tasks involve imagination of different real life actions which may activate the frontal area while occipital channels may be relevant due to activation of visual processing of different visual cues [48,49]. The contribution of frontotemporal reason for feet-vs-word task classification is likely to be because tasks involving word generation and motor action may activate Broca's areas which is the language center in the human brain [50]. An interesting application for the outcome of this study would be for a real-time BCI. For instance, it may be possible to design a machine learning model that determines the best channels during the initial trials of an experiment, as part of calibration. Over the complete session, this should result in higher CA. It was also observed that channels contributing positively to an MI-based classification remain almost the same for good participants. Also, it can be observed that mixed imagery classification performance is significantly higher than MI; this observation can be used to design an enhanced paradigm.
Using the forward elimination approach, in some cases, there were just two channels contributing positively which would make the application of CSP techniques problematic as CSP can only be used on three or more channels. So, in this case CSP channels were ranked using a RandF method and then a minimum of the first 15 channels were selected to calculate accuracy. Then a block of ten channels was added to see the change in performance. It was noticed that the accuracy improved up to 75-85 channels overall and then plateaued or dropped as seen in figure 9. Now, as per the previous literature, researchers have used only sensori-motor area channels to reduce the total NoC for motor-imagery classification. But even when these channels are selected manually, the number is still in the region of 80. Channels selected in the frontal and occipital lobes can also be observed. Similarly for cognitive tasks, a higher NoC must be used for classification. Thus, reliance upon the classical brain areas may not yield the best results. Using feature importance helps in improving the classification performance with a minimum NoC. As the trend can be observed from figure 8, by addition of channels the CA increases and after a certain point it either falls or saturates. Though channel selection helps to improve accuracy, head movement needs to be accounted for. Channel selection not only reduces the dimensionality of data but also helps in online BCI as the computational cost will be lower. Using channel selection, the NoC was reduced from 204 to a range of 1-25 using bandpower and 15-105 using CSP. This method is also helpful in handling data if the ratio of trials to NoC is low, as in this case, it was 50:204.
From the results it can be concluded that mixed imagery (i.e. H-W, H-M, F-W or F-M) has shown better CA than MI tasks. For making an effective BCI a blend of channel selection and mixed imagery can be used. In order to incorporate channel selection in a real-time MEG BCI implementation, the minimum number of trials needed to perform a numerically stable optimal channel selection was investigated. It was observed that the order of the ranked channels remains almost the same after 15 trials. Thus in practice, an optimal channel selection can be implemented in two ways for a real-time BCI using MEG: (1) a classifier can be trained on session 1 data and selected channels from session 1 can be used for session 2 classification using bandpower and/or CSP; (2) channels can be selected, based on the first few (mixed) trials (e.g. 15) based on bandpower or the first n NoC can be considered for CSP, where n will be decided by observing data from same session. Using the second method, improved performance can be observed in the same session.

Conclusion
This paper analyzed the effect of channel selection in improving classification performance for MEG for the first time. It has been observed, that using positively contributing channels reduces the channel count dramatically while the CA can be significantly improved both in the case of bandpower and CSP features, which was consistent across all the six binary classification tasks. However, CSP requires more channels to perform better than the bandpower feature. There was a statistically significant increase (p < 0.05) in performance in all cases. Additionally, it is also observed that mixed imagery (e.g. H-W) performed significantly better than pure MI (e.g. H-F) and CI tasks (e.g. W-M), which can be helpful in designing a paradigm where performance is the priority.