Decoding study-independent mind-wandering from EEG using convolutional neural networks

Objective. Mind-wandering is a mental phenomenon where the internal thought process disengages from the external environment periodically. In the current study, we trained EEG classifiers using convolutional neural networks (CNNs) to track mind-wandering across studies. Approach. We transformed the input from raw EEG to band-frequency information (power), single-trial ERP (stERP) patterns, and connectivity matrices between channels (based on inter-site phase clustering). We trained CNN models for each input type from each EEG channel as the input model for the meta-learner. To verify the generalizability, we used leave-N-participant-out cross-validations (N = 6) and tested the meta-learner on the data from an independent study for across-study predictions. Main results. The current results show limited generalizability across participants and tasks. Nevertheless, our meta-learner trained with the stERPs performed the best among the state-of-the-art neural networks. The mapping of each input model to the output of the meta-learner indicates the importance of each EEG channel. Significance. Our study makes the first attempt to train study-independent mind-wandering classifiers. The results indicate that this remains challenging. The stacking neural network design we used allows an easy inspection of channel importance and feature maps.


Introduction
Mind-wandering is a thought process that is characterized by being not directly relevant to the primary goals in the current context (Smallwood and Schooler 2015). Mind-wandering tends to manifest itself as attentional lapses which often contribute to making errors in a task (Cheyne et al 2006). However, mind-wandering does not result in errors in all cases. Sometimes people can handle their primary tasks well when they start enjoying these periods of selfdistraction as a temporary escape from the current situation (Schooler et al 2011). This positive effect of mind-wandering is particularly common when the current task is associated with low cognitive loadin other words, it is a task in which performance can be achieved with little executive control involved (Randall et al 2019) and therefore low performance is not a fool-proof indicator of mind-wandering. Another behavioral measure that has been proposed to characterize mind-wandering is increased response time variability. Several studies have shown that even when no obvious mistakes are observed, participants display increased variance in their response times when their mind wanders (Bastian and Sackur 2013, Seli et al 2013, Zheng et al 2019, Zanesco et al 2021b. In addition to using behavior, mind-wandering can also be detected by means of physiological and neural measures. For example, researchers found that the pupil diameter in reaction to stimuli becomes smaller in an off-task state (Huijser et al 2018), possibly related to a vigilance decrement that tends to cooccur with mind-wandering (Unsworth and Robison 2016). On the level of the cerebral cortex, mindwandering appears to be associated with inhibited sensory processing to the visual stimuli, referred to as 'perceptual decoupling' (Schooler et al 2011). This perceptual decoupling manifests itself in the electroencephalogram (EEG) as a reduced P1 and increased alpha power (frequency range 8.5 ∼ 12 Hz) observed at the parietal-occipital regions (Kam and Handy 2013, Compton et al 2019, Jin et al 2019. In functional magnetic resonance imaging (fMRI) studies, mind-wandering is associated with increased activation of the default mode network (DMN), together with changes in the connectivity between the DMN and other networks (Christoff et al 2009, Ho et al 2019. Through indicating a memory retrieval process, the involvement of the DMN supports the functional role of mind-wandering as 'spontaneous future cognition' (Cole and Kvavilashvili 2019), potentially aiding problem-solving and creativity (Schooler et al 2011).
Given this set of neural and physiological correlates of mind-wandering, several studies have explored the possibility of predicting mind-wandering on a single-trial level using machine learning. Mittner and colleagues used multimodal signatures of the (co)activations in the DMN, anti-correlated network, and the pupil diameter as the features for a machine learning model to predict mind wandering (Mittner et al 2014). They found that neural data could reliably predict mind-wandering with a median accuracy of 79.7% using leave-one-participant-out cross-validation (LOPOCV). They also found that mind-wandering was not linked to DMN, CAN or pupil diameter alone, but that instead all the features were necessary for the optimal predictive performance (Mittner et al 2014). This seems sensible, given that the contents of mind-wandering can vary considerably. In more recent work, they achieved 65% accuracy across participants by training another multimodality machine learning model (Groot et al 2021). The authors attributed the different performance level achieved across studies to individual biases in self-reports, differences in levels of meta-awareness, and heterogeneity in the thought content being reported (Groot et al 2021). Kawashima and Kumano (2017) found that training an EEG-based mind-wandering classifier with a non-linear support vector machine (SVM) performed better than a linear SVM. They also found that training with a selected subset of electrodes from frontal and parietal-occipital regions performed better than using all the electrodes. This suggests that the association between EEG and mental state fluctuations is more complex than a simple linear relationship and a subset of electrodes at key positions might contain sufficient information for discriminating mind-wandering. Jin et al (2019) explored whether it was possible to predict mind wandering on the basis of various features of EEG data, ranging from power and inter-site phase clustering (ISPC) in the alpha and theta bands, to single-trial eventrelated potentials (stERPs). They were able to predict mind-wandering with an average accuracy of 64% for a sustained attention to response task, and 69% for a visual search task. Additionally, they performed across-task predictions between the sustained attention to response and visual search tasks, with an accuracy of 59% and 60% (Jin et al 2019). In a related study, the researchers used ICA-decomposed EEG in the alpha band to predict the occurrence of mindwandering-albeit this time in a model that generalized across participants. They reported again an average accuracy of around 60% when testing the models on a left-out dataset (LOPOCV), for both within-and across-task predictions (Jin et al 2020).
Several conclusions can be drawn from the studies above. The accuracy with EEG seemed to hit a ceiling at 60% when making generalization predictions across individuals or across tasks. One possibility is the relatively low signal-noise ratio of scalp EEG compared to other neural imaging techniques, such as intracranial EEG or fMRI. However, scalp EEG is still one of the main signals to be used for Brain-Computer Interfaces (BCIs) with healthy users. Another possible cause for low accuracy is that labels come from thought probes, which depend on the accuracy of each individual's introspection. Other setups, for example facial video recording, eyetracking measures and pupillometry can potentially help to validate and revise the subjective thought probes. A third possible cause of low classifier performance is that learning is performed with precomputed EEG features (e.g. P3 or alpha power). These features are selected based on previous studies, but they do not represent all the temporal-frequency information from EEG sensors across the whole scalp. Also, the studies discussed above trained SVMs to learn the relationship between pre-computed features. SVM-when used with a nonlinear kernelis a powerful tool for learning features based on non-linear relationships, but its computational cost is relatively high (O(n 3 ), Abdiansah and Wardoyo (2015) 3 ). To handle a large quantity of the frequency information and/or evoked potentials from all the sensors, we need more powerful machine learning models.
The convolutional neural network (CNN) is a good candidate for addressing this type of problem. A CNN uses a kernel to detect the features of the input signals. With deeper CNN layers, the learned features become more abstract, while at the same time the dimensions of the input decrease. The CNN layers use much fewer parameters than the fully connected (FC) neural networks to reduce the computational and storage cost. In practice, multiple CNN layers are designed to detect useful features. Then the FC layers learn the relationship of the features for the final prediction. Hosseini and Guo (2019) trained a CNN architecture to detect mind-wandering in a meditation task 4 . They trained datasets from two participants and achieved the highest performance of 91.78% during ten-fold cross-validation (CV) within-individuals. The authors also tried to predict across individuals by training a classifier with each individual only and using it to predict the other dataset. The performance dropped to 66% for acrossindividual predictions. Their results indicate that the generalizability of mind-wandering classifiers is challenging, even with state-of-the-art neural networks.
Nevertheless, Hosseini and Guo (2019) only used the raw EEG signal to train their CNN classifiers, which may have limited their performance. Even though all information is necessarily present in the raw EEG signal, not all classifier architectures can extract all informative features for decoding. For example, it is possible that pre-analyses such as single trial ERPs and temporal-frequency analyses reveal information that otherwise remains hidden to the classifier. In addition, they had a limited sample size. In the current study, we will include bandfrequency information and stERPs as input types for the CNN models as well as the raw EEG. Furthermore, we endeavor to train study-independent mindwandering classifiers: we trained and validated classifiers with data from Jin et al (2019, referred to as Study A) and tested the classifiers on the data from Jin et al (2020, referred to as Study B). In addition, each study contained two independent tasks and different groups of participants participated in each of these studies. The design and the probes of both studies can be found in figure 1.
For developing the CCN model, we first transformed the raw EEG into the frequency power spectrum and stERP contour maps. We also included the ISPC reflecting connectivity between EEG channels. Raw EEG, power and stERP from each channel and the ISPC from each channel pair was trained with an independent CNN classifier. The output for each model represents the activations of one on-task neuron and one mind-wandering neuron using one input type from one spatial point (channel or channel pair). To figure out the relationship between the input types and their spatial sampling points, we trained a meta-learner with the concatenated binary outcomes of each input model (figure 2). This allowed us to evaluate the contribution of each input model by mapping their weights to the final output layers of the meta-learner. To compare the performance of our classifier to the other state-of-the-art neural networks, we trained two other CNN models-the aforementioned Hosseini and Guo (2019) model and EEGnet (Lawhern et al 2018). EEGnet is superior in learning frequency patterns with only raw EEG as the input. Its architecture design also allows to map the feature contribution to its spatial patterns. EEGnet has been proven effective at solving classic BCI problems such as motor imagery and P300 speller (Lawhern et al 2018).
Our study makes the first attempt to train studyindependent mind-wandering classifiers. With the meta-learner, our network will allow for an easy inspection of channel importance and feature maps.

Datasets
The research was conducted in accord with the Declaration of Helsinki and approved by the Research Ethics Committee of the Faculty of Arts (CETO), University of Groningen. Participants gave written informed consent. They were financially compensated for their participation during the initial data collection stage. They were debriefed with the main goal of the task after the experiment.

Training dataset
The training dataset is derived from Jin et al (2019). Thirty participants (13 females, ages 18-30 years, M = 23.33, SD = 2.81) took part in the study. They performed a visual search task and a sustainedattention to respond task (SART) for six blocks each in two sessions (figure 1(a)). The main stimuli in the SART were English words that occurred in lowercase for 89% of the time and in uppercase for 11% of the time. Participants were required to press 'm' whenever they saw a lowercase word and to withhold their response when an uppercase word appeared. In the visual search task, participants were given a target shape to search at the beginning of each block. They were required to look for the target in each trial of that block and indicated if the target was in the search panel (yes/no) by pressing the left or right arrow key corresponding to their response. There was an equal probability of the target-present and the target-absent trials. An SART block had 135 trials and a visual search block had 140 trials. Details can be found in the original study (Jin et al 2019).
Participants were interrupted by probe questions asking them about the content of their thinking at that moment ( figure 1(a)). They could respond to this question with one of six options: (1) I was entirely concentrated on the ongoing task; (2) I evaluated aspects of the task (e.g. my performance or how long it takes); (3) I thought about personal matters; (4) I was distracted by my surroundings (e.g. noise, temperature, my physical condition); (5) I was daydreaming, thinking of task-unrelated things; (6) I was not paying attention, but my thought was not anywhere specifically. Each task had 54 probes that were interspersed with the trials. Two consecutive probes were separated by 7-24 trials, roughly accounting for 34-144 s. In the analysis, response 1 and 2 were labeled as an on-task state; response 3 and 5 were labeled as a mind-wandering state, and the other responses were ignored, following the same classification rule as the original work.

Testing dataset
Performance of the classifier was tested on an entirely different experiment from the one on which the classifier was trained. The testing dataset is derived from Jin et al (2020). Thirty participants (16 females, age 18-31 years, M = 23.73, SD = 3.47) took part in the study ( figure 1(b)). In the visual search task, participants either counted the specified target in the following search panel and indicated their response by pressing the number key (counting condition), or passively viewed the search panel and pressed 'j' as a standard response (non-counting condition). In the SART, participants viewed single digits ranging between 1 to 9 drawn from a uniform distribution. They pressed 'j' whenever they saw a digit other than '3 ′ ; if '3 ′ appeared (11%), they were to withhold their response. The visual search task had 21 trials in each block and 20 blocks in total. The SART had twelve blocks with a block length that varied between two to seven repetitions of the nine digits (18-63 trials). Probes were shown at the end of each block in both tasks. Block length varied between one to three minutes. Further details can be found in the original paper (Jin et al 2020).
In this dataset, the probes appeared at the end of each block (which were much shorter than in the training dataset). Participants indicated their momentary attentional state on a rating scale of −5-5 with anchor '−5 ′ for 'totally mind-wandering' , '−2 ′ for 'mind-wandering' , '0 ′ for 'uncertain' , '2 ′ for 'focused' , and '5 ′ for 'highly focused' . Thus, positive ratings were classified as on-task and negative ratings as mind-wandering.

EEG preparation 2.2.1. Recording hardware
The training dataset had 128 channels, while the testing dataset had 32 channels. All electrode locations were within the International 10-10 System. In the current analysis, we only considered the 32 channels that overlapped between the training and testing studies (figure 2). The data were recorded using the Biosemi ActiveTwo recording system. The online sampling rate was 512 Hz. The Biosemi hardware does not have any high-pass filtering. An anti-aliasing filtering is performed in the ADC's decimation filter (www.biosemi.com/faq/adjust_filter.htm).

Preprocessing
Data had already been preprocessed in the original studies, and we reused these preprocessed data here. The offline EEG preprocessing was done with the EEGLAB toolbox in MATLAB. Continuous EEG was re-referenced to the average signal of both mastoids. The band-pass filtering was set to be 0.5-40 Hz and 0.1-42 Hz for the training and testing datasets, respectively. Both datasets were down-sampled to 256 Hz. The original segmentation was [−400 1200] ms with respect to stimulus onset for Study A, and [−1000 3000] ms for Study B. In the current study, Figure 2. Channel locations (32-channel 10-10 system). EEG from each channel is transformed into power spectrum and single-trial ERP (stERP) contour maps. The 16 channels in red font were also used to compute the inter-site phase clustering (ISPC). Each input type from each spatial location (channel or channel pair) was trained with three convolutional layers (of which neuron numbers are 16, 32 and 32) with a max-pooling layer after each CNN. For raw EEG, we used a 1D CNN with a kernel length of 5 to account for the pattern of every five temporal sampling points. For other types of input, we used a 2D CNN with a kernel size of (5,3) to learn the pattern of 5 temporal sampling points and 3 frequency sampling points. The obtained model with each input type from one spatial location serves as the input model for the meta-learner, which learned the relationship of the binary outcomes between each input model with two fully connected (FC) layers and generated the final prediction.
we took the overlapping [−400 1000] ms as the time window for analysis. The 200 ms before the stimulus onset was used as the baseline. Ocular artifacts were detected and removed using the infomax independent component analysis.

Transformation
Apart from the raw EEG data, three kinds of transformations were performed on the original EEG matrix-power and ISPC derived from a timefrequency decomposition using the complex Morlet wavelets (Cohen 2014), and stERP analysis (Bostanov 2004, Bostanov and Kotchoubey 2006, Jin et al 2019. These transformations were performed in MATLAB with an in-house script.
A complex Morlet wavelet is created using the equation where f denotes the frequency in Hz, and n refers to the number of wavelet cycles. Convolving the complex wavelet with the original signal gives complex dot products, from which the amplitude can be extracted through the vector length of each complex data point (power is obtained from the squared amplitudes, figure 2(c), and the phase angle is obtained from the angle with respect to the positive real axis). A set of wavelets was created with frequencies ranging from 4 to 40 Hz with 35 frequency sampling points in logarithmic space, covering multiple frequencies in each of the theta/alpha/beta/gamma bands. Delta oscillations (1-3 Hz) were excluded because our time window of 1400 ms was not long enough to estimate delta power with sufficient accuracy. The upper boundary of the analyzed frequency was limited by the bandpass filtering during the preprocessing. The number of cycles used for the wavelet increased from 3 to 7 in a logarithmic spacing on the frequency axis.
ISPC was computed as a measure of the connectivity between channels (Cohen 2014): in which Φ x and Φ y are the phase angles from electrode x and y. ISPC is computed as the averaged phase angle differences (which were also mapped to the complex plane) in a moving time window. The length of the averaged angle difference denotes the clustering (i.e. the more the phase difference remains constant, the more their length adds up). The window length increased from 3 to 5 cycles linearly with the frequency. As shown in figure 2, there is an empty region of half the wave length at both sides of each frequency with no averaging data. Those regions were set to be zero during machine learning. The ISPC was computed in a 16-channel layout resulting in 120 channel pairs (Channel layout in red font, figure 2) to reduce the number of channel pairs.
Both the power and ISPC matrices were downsampled to 50 Hz given the fact that time-frequency analysis smears out the signal over time.
The stERP analysis, which quantifies the bumps in the EEG in terms of temporal location and amplitude was performed through computing the crosscovariance between the signal and the kernel ψ(t): With two varying parameters-time lag τ and scale s (an indication of wavelength), the crosscovariances can be mapped to a contour graph (figure 2(e)). In the current study, the time lag (τ) ranges from 0 ms (stimulus onset) to 1000 ms (end of the EEG epoch) following the same temporal resolution as the raw EEG. The scale (s) ranges from 1 to 2500 ms in logarithmic space with 300 sampling points. The obtained matrix is down-sampled to 65 sampling points in the time lag and 30 points in the scale (frequency dimension).
Together, this creates four types of inputs for the CNNs: raw EEG, power, ISPC, and stERP (figure 2).

Neural network 2.3.1. Architecture of the neural network
The CNN input model and the meta-learner structure are shown in figure 2. Its design was based on prior work reviewed by Roy et al (2019), complemented with an extensive parameter exploration. Specifically, for the raw EEG input, we used three layers of a 1D CNN with a kernel size of five temporal points.
The neuron numbers are 16, 32 and 32 sequentially 5 . Each convolutional layer is followed by a max-pooling layer with a pooling size of 2. The output of the third max-pooling layer is flattened and connected to two FC layers with 200 and 50 neurons. The activation function is ReLU for all the hidden layers. The output layer uses a softmax activation function to calculate the activation of one on-task neuron and one mind-wandering neuron. The CNN structure for the power, ISPC or stERP is similar to that of the raw EEG except that a 2D CNN replaced the 1D CNN. The kernel size of the 2D CNN is (3,5) to account for the pattern of three frequency points and five temporal points. Each 2D CNN layer is followed by a maxpooling layer with a pooling size of (2,2). The learning was performed with categorical cross-entropy as the loss function and optimized by Adam at a learning rate of 0.0001. The training batch size is 120, iterated over 200 epochs.
The meta-learner is trained with the concatenated binary outcomes from each input model. For the raw EEG, power and stERP, we trained 32 input models (using data from each channel). For the ISPC, we trained 120 input models (using phase clustering from each channel pair). Thus, altogether, we had 2 × 32 × 3 + 2 × 120 = 432 outputs from all the input models to form a (1, 432) vector as the input to the meta-learner (Meta_Full). We also trained two other meta-learners using a subset of those input models. One is without the power but with the other three input types (Meta_NoPower), because power was the most overfitting input according to a preliminary analysis (suppl. II). The third meta-learner used only stERP (Meta_stERP), as this input type performed best during validations in the preliminary study (suppl. II).
The meta-learner used two FC layers consisting of 100 and 20 neurons. The final output consisted of two neurons that provide the prediction of on-task or mind-wandering. The hidden layers of the metalearner used standard ReLU activation functions and the output layer used softmax. The meta-learner was trained with batches of 200 trials and iterated over 50 epochs, performed by categorical cross-entropy as the loss function and optimized by Adam at a learning rate of 0.0005.
We set the dropout to be 0.4 to train the input models and 0.2 to train the meta-learner in order to prevent overfitting. The proposed CNN models and meta-learner is implemented on a workstation with 5 The choice of neuron number is decided according to preliminary results, in which we tested CNN design with [16,16,8], [16,32,32] and [64,64,32] neurons. The results (Suppl. III) indicate an improvement on ROCAUC during validations by increasing neurons from [16,16,8] to [16,32,32] (.513, .519 respectively). However, adding more neurons to make a design as [64,64,32] did not increase the ROCAUC on validations (.519). Therefore, we decided [16,32,32] to be the proposed network design following Occam's Razor Principle.
an Intel Xeon 2.2 GHz, 512 GB RAM, and a TITAN X, TITAN Xp, and two GeForce RTX 2080 graphical cards with CUDA V10.1.243 (one or two of the four GPUs depending on the availability) using Python 3.7 and the keras machine learning library.

Validation and testing
A leave-N-participant-out cross-validation (LNPOCV) was used to assess the performance. The N was decided to be 6 to account for 20% of the datasets given that the total sample consisted of 30 participants. In that sense, we are performing five-fold CV across individuals. In each fold, we set aside 6 individual datasets for validation purposes and trained the classifiers with the other 24 individual datasets. Each individual dataset was used once for the validation. The performance was indicated by both the accuracy and the area under the curve (AUC) of the Receiver Operating Characteristic (ROC).

Comparison models
We trained a CNN using the architecture by Hosseini and Guo (2019) 6 and an EEGnet 7 for comparison purposes. Both used only raw EEG from multiple channels as input. The validation and testing procedures remained the same.

Network performance
We used trials within 12 s before each probe for both the training and testing samples. This time window results in 3169 on-task (OT) trials and 1976 mindwandering (MW) trials from Study A, and 1266 OT and 424 MW trials from Study B.
We first examined the best normalization approach for each input type in a preliminary analysis. Given that the classes were not balanced originally, we adjusted the weight of the cross-entropy loss of the MW class while keeping the weight of the loss of OT constant at 1 (by setting the class_weight in Keras.model.fit()). The best normalization and class weights for each input type can be found in Suppl. I and II. Input types were normalized with the best normalization and trained with the best class weights to be the input models for the meta-learner.
The performance of the meta-learners, as well as the comparison models, are listed in table 1. The models learned well during the training, with above 75% of the AUC for all models. However, the classification performance dropped when predicting the left-out datasets. The meta-learner with stERP performed relatively the best during validations, achieving 59% accuracy and .57 for the AUC, indicating a mild level of generalizability across individuals. The meta-learner with power as the input (Meta_Full) and the Hosseini2019 model showed strong signs of overfitting. They both correctly identify almost 100% of the training labels. However, their performance on the validation datasets was almost at chance level (ROCAUC .505 and .509 during LNPOCV).
Finally, we tested all the models by predicting the data of Study B. The performance decreased further, as expected. Only two of our meta-learners (Meta_Full and Meta_stERP) can predict above chance level in both accuracy and AUC. The metalearner with stERP (accuracy .519 and .506) is slightly better than the meta-learner with all input types. Interestingly, Hosseini2019 achieved the best accuracy (.545) during the testing while the ROCAUC was .492. EEGnet2018 had relatively high ROCAUC (.539), while the accuracy was lower (.457). Our Meta_stERP model achieved balanced performance in both accuracy and ROCAUC.

Learned features
We investigated the importance of the input locations by analyzing the weights of the input model in the meta-learner. Given that Meta_stERP performed the best, we mapped location importance using weights derived from this model. The output of the metalearner consisted of one neuron responding to the OT state and another neuron responding to the MW state. As each input model also represented one OT neuron and one MW neuron, the mapping gives two weight topoplots for the meta-OT neuron (OT-OT, MW-OT) and two weight topoplots for the meta-MW neuron (OT-MW, MW-MW). To simplify the interpretation, we performed PCA on the two weight topoplots for each meta-output neuron and obtained the largest PC accounting for 100% of the variance ( figure 3(a)). The importance of channel locations in the PC topoplots looks similar between the meta-OT and the meta-MW neurons. Some channels are important in both OT and MW activations (e.g. P4). Some other channels are responded to more by one of the two classes (e.g. C3 decides the activations of the meta-OT neuron more than the meta-MW. O2 decides the activation of the meta-MW neuron more than the meta-OT).
We chose P4 as an example to plot the feature maps because it is informative in activating both the meta-OT and meta-MW neurons ( figure 3(a)). Feature maps are based on the output of each convolutional layer ( figure 3(b)). Comparing the feature map of one OT trial and one MW trial, especially the feature map of the third convolutional layer, we found the features are identified during the whole time series and across the whole scale range of the stERP, indicating that all the ERP components are likely to be predictive. The activation of the middle scale range seems to be highlighted in the MW trial compared to the OT trial. Given that the middle scale range is equivalent to Note: Model that performs the best during the leave-n-participant-out-cross-validation (LNPOCV) is indicated by bold. the wavelength range looking for early sensory evoked potentials, this indicates that early sensory processing is likely to feature mind-wandering.

Discussion
In the current study, we trained meta-learning neural networks with multiple CNN input models to classify mind-wandering. Each input model was trained with one EEG input type from one spatial sampling point to predict mind-wandering separately. Thus, the meta-learner not only learned a prediction based on combining all the input models but also allowed us to map the importance of each channel by examining the weights of the input models to the meta-learner outputs.
The current results demonstrate the difficulty of achieving mind-wandering detection that is generalizable across individuals or studies. The generalization across individuals within the same study is shown during the LNPOCV. While the CNN classifier and the meta-learners performed well on the training datasets (above .75 for the ROCAUC), performance on other datasets that it had never seen dropped below .6 in the accuracy and AUC. Furthermore, the performance on the testing datasets derived from other studies indicated that across-study predictions were hardly achieved.
Nevertheless, the currently proposed metalearner with stERP as the input achieved the best performance during the validation and testing stages. We examined the channel importance by mapping the weights of each input model to the outcomes of the meta-learner and found a similar channel importance map between the final OT and MW neuron as learnt by the meta-learner. Finally, by looking at the feature maps between classes, we could understand how the CNN learned the patterns from the stERP contour maps.
The largest limitation of the current study is the generalizability of the CNN classifiers, even though we addressed the problem with state-of-the-art neural networks. We attribute the main cause to the heterogeneity of mind-wandering: on one side, individuals differ in their mind-wandering thoughts as well as in the patterns of the neural activation associated with mind-wandering (Christoff et al 2016, Wang et al 2018, Zanesco et al 2021a; on the other side, mindwandering while performing another task is essentially a dual-tasking process-individuals are likely to keep working on their primary task without the performance being interrupted if the primary task is low-demanding or habitual (van Vugt et al 2015). In that case, 'free' cognitive resources are available to be used by mind-wandering (Taatgen et al 2021), making mind-wandering generation 'hidden' and difficult to discriminate in neural data. How to improve the accuracy of self-reports, or more generally the precision of mind-wandering data collection is a methodological issue in experimental psychology, which is outside the scope of the current EEG decoding study.
Interestingly, we trained the current neural network with the same datasets used in Hosseini and Guo (2019) and tested the modeling performance during similar across-individual predictions. We achieved 76.0% and 70.4% accuracies for the across-individual prediction, which are higher than the 67.63% and 65.26% as reported in the original study. This indicates that the current neural network is even suitable for learning EEG signals in BCI studies. It seems that the modeling performance varies according to the classification goals: inter-individual generalizability is easier to achieve with a model based on the same task than on multiple tasks.
As indicated, the current architecture can also be used to detect mind-wandering in an online setting (i.e. for BCI). Based on the current results, for such applications we recommend considering individual classifiers instead of inter-individual classifiers to detect task-general mind-wandering, or alternatively to use inter-individual classifiers to study mind-wandering within the same task. Training taskgeneral inter-individual mind-wandering classifiers is at this point too challenging to achieve sufficiently high performance.

Conclusion
The current study indicates that a generalizable classifier to detect study-independent mind-wandering episodes with scalp EEG remains challenging. Nevertheless, we found that the meta-learner with input models trained with stERP contour maps performed the best. We also showed how this work can contribute to explainable artificial intelligence by giving an example of how channel contributions and the learned features can be examined by means of the weights of the input models and the feature maps.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// unishare.nl/index.php/s/T94LXPQqw5FEA4J.