Exploring inter-trial coherence for inner speech classification in EEG-based brain–computer interface

Objective. In recent years, electroencephalogram (EEG)-based brain–computer interfaces (BCIs) applied to inner speech classification have gathered attention for their potential to provide a communication channel for individuals with speech disabilities. However, existing methodologies for this task fall short in achieving acceptable accuracy for real-life implementation. This paper concentrated on exploring the possibility of using inter-trial coherence (ITC) as a feature extraction technique to enhance inner speech classification accuracy in EEG-based BCIs. Approach. To address the objective, this work presents a novel methodology that employs ITC for feature extraction within a complex Morlet time-frequency representation. The study involves a dataset comprising EEG recordings of four different words for ten subjects, with three recording sessions per subject. The extracted features are then classified using k-nearest-neighbors (kNNs) and support vector machine (SVM). Main results. The average classification accuracy achieved using the proposed methodology is 56.08% for kNN and 59.55% for SVM. These results demonstrate comparable or superior performance in comparison to previous works. The exploration of inter-trial phase coherence as a feature extraction technique proves promising for enhancing accuracy in inner speech classification within EEG-based BCIs. Significance. This study contributes to the advancement of EEG-based BCIs for inner speech classification by introducing a feature extraction methodology using ITC. The obtained results, on par or superior to previous works, highlight the potential significance of this approach in improving the accuracy of BCI systems. The exploration of this technique lays the groundwork for further research toward inner speech decoding.


Introduction
Brain-computer interfaces, or BCIs, are a cuttingedge technological advancement that creates direct communication link channels between the human brain and computers.Their applications are rapidly expanding and evolving, allowing them now to encompass a wide range of medical and non-medical services.BCIs can now be used for things including brain state monitoring, neuro-rehabilitation, gaming, and even boosting human cognitive augmentation (Gao et al 2021).
Currently, one of the most promising applications of BCIs is inner speech recognition, also known as covert speech.This technology's particular objective is to restore communication for individuals facing speech impediments through using alternative communication devices (AAC) (Gonzalez-Lopez et al 2020).These devices, known as silent speech interfaces (SSIs), can decode brain signals generated during inner speech process.Essentially, inner speech recognition with SSIs allows users to communicate by converting their thoughts to artificial speech using speech synthesis.
Among the several techniques that can be used to record brain signals, electroencephalogram (EEG) is the preferred choice for most of the research works that have explored inner speech recognition (Panachakel and Ramakrishnan 2021).Its popularity comes from its easy setup, wearability, and good temporal resolution (Torres-García et al 2022, Munavalli et al 2023).These advantages make EEG an invaluable tool for capturing brain activity during various tasks, including inner speech recognition.
While BCIs based on EEG hold great promise, it is important to mention that external noise sources, such as electrical devices, and artifacts like blinks, breathing, and gaze changes can easily contaminate EEG signals.This contamination leads to a low signal-to-noise ratio (SNR), highlighting the significance of implementing pre-processing techniques to increase the SNR.Typically, the pre-processing stage involves downsampling (Jahangiri et al 2018(Jahangiri et al , 2019)), band-pass filtering (Roy et al 2019), and artifact removal (Brigham and Kumar 2010, Jahangiri et al 2018, Cooney et al 2019).Nevertheless, the specific pre-processing steps are not unique and may vary depending on how the data is collected and implementation.
Moreover, when attempting inner speech decoding, it is important to extract relevant features that better characterize each of the brain signals.For effective inner speech decoding, the most informative feature extraction methods encompass the time, frequency, and time-frequency domains.Most authors consider that analyzing the signals in the frequency domain (focusing on specific wave patterns) or the time-frequency domain (looking at how these patterns change over time) is necessary to capture the non-stationary properties of EEG (Chinara et al 2021).In the frequency and time-frequency domains, there are three essential methods for feature extraction in recent research: mel-frequency cepstral coefficients (MFCCs), short-time Fourier transform (STFT), and wavelet transform (WT) (García-Salinas et al 2018, Panachakel et al 2019, Pan et al 2021).These feature extraction techniques play a crucial role in enhancing the performance and accuracy of inner speech recognition systems by allowing researchers to examine the different frequencies that exist in an EEG signal and how they change over time.However, it is important to remember that these strategies may lead to an increase in computational time and spatial complexity.
In the case of BCI for inner speech recognition, the key to successfully identifying the meaning from EEG signals lies in recognizing the essential patterns extracted from the EEG signals.Hence, once the most relevant features are withdrawn, the EEG signal has to be classified.Researchers have explored both classical machine learning and deep learning algorithms to achieve this goal.These algorithms play a vital role in translating the brain's language into clear commands or thoughts for inner speech recognition systems.Some of the common algorithms include k-nearest-neighbor (kNN) (Brigham and Kumar 2010) Despite most approaches in the literature where the recorded trials for a particular word in each subject and session are considered distinct events, there is an overlooked potential for analyzing signal phase consistency among the trials of a recorded inner speech event as a relevant feature.This possibility arises from the feasible existence of an event-related phase response concerning an event onset (Roach andMathalon 2008, Michalopoulos et al 2011).Thus, analyzing consistency across trials could lead to a more specific representation by highlighting features that maintain stability throughout the trials.In simple words, by looking for consistency across trials could potentially identify more specific features that reliably represent the inner speech.
The technique called inter-trial coherence (ITC) focuses on these possibilities.In fact, for BCIs, diverse works have explored the usefulness of the ITC framework to decode information from EEG-recorded signals.For instance, Yagura et al (2020) analyzed phase synchronization of EEG brain wave signals to examine indicators for selective attention and cognitive load.Similarly, Butler et al (2017) used ITC to investigate neural variability in children on the autism spectrum.Also, Saxena and Gupta (2021) explored the application of the ITC method in characterizing brain signals during motor imagery processes.However, the use of the ITC framework for inner speech decoding is still in its early stages and remains relatively limited.
Consequently, and taking as reference the insights provided in the previous discussion on SSIs and feature extraction methods, the main objective of this study is to introduce a novel approach for feature extraction of brain signals in inner speech decoding.The proposal considers the cross-trial coherence that may be present in the analyzed EEG signals.
Specifically, the research aims to use ITC to identify the frequencies and time instances that best represent each word for a given subject and session in the dataset, along with their corresponding phases at those specific instances.
This study uses Morlet wavelets to obtain the TFR for transformation since it is a widely used technique for capturing time-frequency power and phase information from time-domain signals.A critical advantage of Morlet wavelets is their Gaussian nature, which helps minimize ripple effects caused by sharp edges (Cohen 2019).Additionally, this method offers computational efficiency compared to other time-frequency domain analysis methods, such as the fast Fourier transform (FFT) (Cohen 2019).To our knowledge, no previous work has employed ITC along Morlet wavelets for feature extraction in inner speech recognition.This innovative approach could potentially improve inner speech decoding systems' accuracy significantly.
The rest of the paper is organized as follows: the second section describes the four steps of the methodology, explaining the ITC framework and its application as a feature extractor.Moving forward, the third section delves into the acquired results, and the fourth section engages in a discussion, where we compare our findings with those of previous works.Finally, the fifth section includes the conclusions drawn and outlines avenues for future research.

Methodology
The methodology of this work comprises four main steps: database exploration, signal pre-processing, signal processing, and machine learning classification, as described in figure 1.Each of these components will be described in detail in the following sections.

Database exploration
The EEG database utilized in this study is referred to as 'Thinking out loud,' recently published by Nieto et al (2022).This database contains brain signals from ten participants, consisting of six males and four females, with an average age of 34 years (σ = 10 years).All participants were considered to be in good health, with no reported neurological, psychiatric, or movement disorders.The participants were instructed to perform three different tasks: pronounced speech, inner speech, and visualization, for four different words: up, down, right, and left (figure 2).For the purposes of this research, the recordings for the pronounced speech and visualization tasks were excluded, focusing only on the inner speech recordings.For these recordings, the participants were instructed to mentally simulate their voice, and repeat the associated word until the white circle transitioned to a blue one.To foster a natural imaginative process, no rhythm cues were given.Additionally, the dataset's authors noted a lack of correlation between the various imagined words and the event-related potentials (ERPs).This observation could be attributed to the consistent nature of visual cues across all recording trials (Nieto et al 2022).Moreover, the dynamics of event-related desynchronization/eventrelated synchronization (ERD/ERS) are often influenced by visual stimuli.However, as the visual cues in this dataset remained constant across words, trials, and subjects, the analysis of ERD/ERS dynamics is not conducted in this study (Pfurtscheller 2001).The brain signals were all recorded within a single day for all participants.The recording procedure involved three consecutive sessions where the signals were captured using a 128-channel EEG system, supplemented with 8 additional recording channels for electrooculography/electromyography (EOG/EMG).The sampling rate for all channels was set at 1024 Hz.A total of 559 inner speech trials were recorded for each word, resulting in a total of 2236 inner speech trials.

Time interval rejection
According to the authors, all the inner speech trials in the dataset had a duration of 4.5 s.Within this duration, the first 0.5 s of each trial were dedicated to the concentration interval.During this time interval, participants were instructed to focus on a white circle displayed on a monitor and maintain their gaze fixed on it without blinking until the circle disappeared (Nieto et al 2022).At t = 0.5 s, a cue indicating the word to be thought about appeared on the monitor, represented by a triangle pointing in the direction of the given word.Following this, at t = 1.0 s, a white circle reappeared on the monitor, specifying the beginning of the action interval during which participants engaged in the inner speech task.This interval had a duration of 2.5 s.Finally, at t = 3.5 s and until t = 4.5 s, a blue circle appeared on the monitor, signifying the rest interval period.This workflow is visually represented in figure 2, adapted from the illustration presented by Nieto et al (2022).
It is important to recall that the main objective of this study is to extract features that enable the classification of the four distinct words in the dataset.As a result, three of the four time intervals described earlier were deemed non-essential in achieving this objective because they do not contain information about the inner speech action interval.In particular, the brain signal behavior presented similarities even across different trials of different inner spoken words during the concentration interval, cue interval, and relax interval.Consequently, the information that belonged to these time intervals was discarded as it did not contribute significantly to the feature extraction process.Therefore, the focus remained on extracting representative features from the action interval, where the brain signals could contain distinctive patterns relevant to the classification of the target words.

Signal pre-processing
The EEG signals went through several pre-processing steps to eliminate line noise and artifacts.The preprocessing steps for this work were applied as proposed by Nieto et al (2022).Specifically, a finite impulse response (FIR) notch filter was applied to remove the power line noise at 50 Hz.It is important to recall that raw EEG signals are characterized into five frequency bands: gamma (30-100 Hz), beta (13-30 Hz), alpha (8-12 Hz), theta ( 4 (2022).The initial 0.5 s of each trial displayed a white circle, defined as the concentration interval.Subsequently, during the cue interval, a triangle appeared, pointing in the direction of the word to be thought.Following this, a white circle reappeared, marking the commencement of the action interval during which the subject performed the inner speech process of the word without any prescribed rhythmic cue.Finally, a blue circle emerged, signaling the conclusion of the action and the onset of the rest interval.filter from 0.5 to 100 Hz was applied, excluding frequencies outside this spectrum.
To further decrease the data size without causing aliasing, the EEG signals were decimated, reducing the sampling rate from 1024 Hz to 256 Hz.Additionally, blind source separation through independent component analysis (ICA) was employed to eliminate artifacts from the recorded signals.
The EOG/EMG channels were particularly useful in identifying sources of blinks, mouth, and gaze movements, allowing their elimination during the ICA procedure.All the previous steps lead to the acquisition of 128 independent components with a 256 Hz sampling rate.The whole preprocessing procedure is shown in figure 3, where the purple rectangles represent the pre-processing steps and the blue rectangles represent the inputs and outputs.

Signal processing
The signal processing procedure comprised three main steps: Morlet WT (time-frequency representation), ITC process, and feature extraction.The flow diagram of this process is shown in figure 4 and each of the previous steps is fully explained in the next sections.

Time-frequency representation
It is well known that brain signals possess a nonstationary nature, indicating that their features change over time and vary among recorded participants.As a result, the relevant characteristics and  STFT, an extension of the standard Fourier transform (FT), offers a segmented analysis of the windowed signal instead of providing frequency information for the entire signal as the FT (Krishnan 2021).This allows the STFT to yield representations that capture both the local time and frequency content in the signal which is good to better characterize the dynamic behavior of brain signals is addressed (Rajoub 2020).
In contrast, the WT technique is a generalization of the STFT, offering a precise representation at both low and high frequencies.With that being said, in this study the timefrequency representation (TFR) of the pre-processed time-domain EEG signals was obtained using Morlet Wavelets.As mentioned earlier, the frequency range of the EEG signals goes from 0.5 Hz − 100 Hz, and then 300 logarithmically spaced frequencies were selected within this specified range to be studied.
All of the previous processes, considering an action time interval of 2.5 s at a sampling ratio of 256 Hz, resulted in a final complex time-frequency representation matrix with dimensions of 300 × 640 for each word.The first dimension represents the different frequencies, while the second dimension corresponds to the number of samples.

Inter-trial coherence framework
When we perform time-frequency transformation using WT, the result is a series of complex numbers that describe a position in a two-dimensional plane.Using these two numbers we can calculate the phase of the signal at a given time and frequency.Then, using ITC, also known as inter-trial phase coherence, we can explore phase synchronicity across multiple trials.
In simple terms, it provides information about the level of synchronization among several neuroelectric signals at different frequencies and time instances.ITC is calculated as shown in equation ( 1), where N represents the number of trials, F k (f, t) is the spectral estimate of trial k at frequency f and time t, and || ITC values range from 0 to 1, with 0 indicating no consistency between trials and 1 representing perfect phase synchronicity among them (Delorme and Makeig 2004, Chikara et al 2020, Engel et al 2020).Moreover, ITC projects the time-frequency components from multiple trials to the unit circle (Martínez-Montes et al 2008).Figure 5 illustrates two distinct cases, where each line represents a trial.When the lines are closely aligned, it indicates high phase synchronicity, corresponding to high ITC, and vice versa.In brain signal analysis, understanding the degree of synchronicity helps determine the uniformity of the phase angle across trials and identify time-frequency features that represent a specific neural behavior.
For this work, the first step of the ITC process was to compute the ITC between all the trials of each word, session, and subject of the study.It is worth mentioning that a different ITC representation was obtained for each of the 128 EEG channels.The objective of this step was to check whether any additional time instances, beyond those previously discarded, exhibited substantial similarity even across different words.
After computing the ITC representation, as shown in figure 6, it was found that the first 0.5 s of the action interval displayed high phase synchronicity, irrespective of the word, session, or subject under consideration.This observation is logical considering that the white circle presented at the beginning of the interval may have caused a P300 evoked potential.(Gutiérrez-Martínez et al 2021).Moreover, this figure demonstrated that there was considerable ITC in the previously discarded time intervals.Taking these findings into account, it was decided to also exclude the initial 0.5 s of the action interval, concluding that only the time frame from t = 1.5 s to t = 3.5 s of the trials contained relevant information pertinent to the inner speech process.It is important to mention that the same findings were obtained for all the subjects and sessions of the experiment, despite figure 6 showing the ITC for only one subject.
The next step, after defining the relevant time interval to be analyzed, involved obtaining the ITC representation for each word and session individually for each subject.After this, we selected the top 15 frequencies and time instances that displayed the highest ITC values for each of the 128 EEG channels.This process resulted in 12 (4 words × 3 sessions) distinct 128 × 15 × 2 tensors for each of the 10 subjects.In this context, the first dimension of the tensors corresponds to the EEG channels, while the 15 × 2 represents the 15 frequencies and 15 time instances respectively.Then, the ITC representation of each EEG channel of the previously obtained 120 tensors was analyzed separately to obtain the Morlet circular mean phase for each of their top frequencies and time instances.This was done separately for each word, subject, and recording session.
Hence, to obtain this mean value, the initial step involved generating the complex Morlet timefrequency representation for each channel.Later, the phase at each of the selected top frequencies and time instances was determined from the TFRs.It is important to mention that, on average, a recording session consisted of 20 trials, resulting in approximately 20 phase values (one per trial) for a specific channel and instance within each recording session, word, and subject.Then, to acquire a representative phase value for a given channel and frequency-time instance, the circular mean of these phase values was computed.As explained previously, the top 15 instances were chosen, thus resulting in a final representation per word per session per subject comprising a tensor of dimensions 128 × 15 × 3.In this new tensor, the first dimension represents the channels, while the 15 × 3 represents the top 15 frequency-time instances and their respective mean phase values.Considering that there were 4 different words, the resulting overall tensor was of size 128 × 60 × 3.Each recording session of each subject was characterized by a tensor as the one previously described.

Feature extraction
The last step of the proposed fea ture extraction process involved obtaining the characteristic features for each trial of every word based on the tensor obtained in the previous step.It is worth noting that this procedure was carried out for every recording session and subject within the dataset.Individual analysis was conducted for each channel for a specific trial of a particular word, where the main focus was the 60 relevant phase values at the frequency-time instances presented in the overall matrix of the previous step.
During this individual examination, the phase value of a specific channel at a particular frequencytime instance was compared with the previously calculated 60 mean phase values to determine the error between the two values.The error was calculated as shown in the equation ( 2) where ϵ N k,m (f, t) represents the error between the phase of channel k of trial N and the previously calculated mean phase value m of that channel at one of the 60 particular frequencies f and times t.The objective of this procedure was to select the top 15 instances with the lowest error, representing the characteristic frequency-time instances for each trial.Since this procedure was performed on all 128 channels, each trial of every word in each session was represented by 5760 features.This number is a result of having 15 relevant frequency instances, 15 relevant time instances, and 15 phase values per instance for each of the 128 EEG channels (45 × 128).
Taking this into consideration, the final database for word classification was represented by a (N × 4) × 5760 matrix as shown in figure 4. The first dimension, represented by (N × 4), corresponds to the N trials of each of the four different words, while the second dimension represents the distinctive features of each trial.We use the letter N to indicate the number of trials because, despite there being an average of 20 trials per session, this number may vary depending on the session.

Classification algorithms
In this study, we applied the kNN and SVM algorithms for the classification task.kNN classifies data by considering features and labels from the training data.It classifies datasets by examining the k nearest training data points (neighbors) to the testing point.The algorithm then employs a majority voting rule to determine the final classification (Uddin et al 2022).On the other hand, SVM is a classification algorithm that aims to find a hyperplane to separate different classes with a maximum margin, using support vectors.It can handle non-linear data through kernel functions, such as linear, polynomial, RBF, among others (Shi and Zhang 2020).As previously mentioned, these machine learning algorithms have been widely employed in previous research for inner speech recognition (Brigham and Kumar 2010, Deng et al 2010, González-Castañeda et al 2017, Cooney et al 2018, Gasparini et al 2022).It is important to recall that deep learning methods have also been employed for inner speech classification (Zhao and Rudzicz 2015, Saha and Fels 2019, Saha et al 2019a, 2019b, Chengaiyan et al 2020, Tamm et al 2020, van den Berg et al 2021, Gasparini et al 2022).However, it is crucial to note that this study represents an initial test of the proposed feature extraction method.Further investigation is required to evaluate the effectiveness of deep learning algorithms in conjunction with the proposed method.To robustly determine the performance of the classifiers, we evaluated their accuracy using a 5-fold cross-validation approach.

Results
The main goal of this work is to introduce a novel method that utilizes the time-frequency representation of Morlet wavelets and the ITC framework to extract relevant features from EEG signals for inner speech classification.It is important to recall that inner speech data is highly variable.Therefore, in order to take this into consideration, this work focused on multi-class classification.
Several experiments with the kNNs classifier indicated that k = 9 nearest neighbors resulted in the most optimal outcomes for the studied trials.The classification accuracy achieved using the kNN classifier for each subject and recording session is presented in table 1.The SVM classifier demonstrated a higher average classification accuracy, as shown in table 2.Moreover, a comparison with other works that have explored multi-class classification for inner speech recognition is given in tables 3 and 4, where the former contains a comparison with works that used a different dataset than the one presented here, while the latter presents a comparison with works that use the same dataset as this work.

Discussion
This work focuses on inner speech recognition with a four-class classification, utilizing the ITC framework and employing SVM and kNN classifiers, achieving accuracies of 59.55% (SVM) and 56.08% (kNN).Several works have attempted to tackle this same problem using different datasets, as shown in table 3, which may contain more or fewer classes than the dataset used in this work.
We first compared our work with others that feature significantly more classes, highlighted in yellow in table 3.This comparison aimed to underscore the challenges associated with handling increased class complexity.The work presented by Cooney et al (2018) dealt with 11 classes using MFCC to extract the most relevant features, yet their SVM accuracy of 20.8% falls significantly short of the results achieved here.Similarly, Lee et al (2020) faced difficulties with 12 classes, achieving an average accuracy of 16.16% using an LDA classifier and common spatial pattern (CSP) features.
Next, when comparing our results with those from studies addressing one additional class (highlighted in green in table 3), García-Salinas et al (2018) achieved an accuracy of 75.90% using SVM, demonstrating the potential advantages of exploring canonical decomposition as a feature extraction technique for improving classification performance.Remarkably, our study exhibits lower accuracy even with one fewer class compared to the findings reported by García-Salinas et al (2018).This discrepancy may arise from differences in the feature space construction, potentially leading to less data sparsity in their approach compared to ours.Additionally, García-Salinas et al (2018) employed the CAR algorithm to enhance the signal-to-noise ratio and utilized other techniques such as channel selection algorithms.These approaches may have contributed to mitigating overfitting in their machinelearning models and could explain their superior results.
On the other hand, Cooney et al (2019) and Tamm et al (2020) showed less performance with 5 classes using CNN as a feature extractor and a dense layer as a classifier, achieving accuracies of 35.68% and 23.98%, respectively.While the results achieved here surpass theirs, it is important to recall that those works used one extra class, which points out the potential of deep learning methods.However, it is important to mention that these methods usually require more data than the one available in this dataset to perform well, so if applied to this dataset, we must be careful.
Moving on to works with fewer classes, marked in gray in table 3, Brigham and Kumar (2010) achieved 61% accuracy in a binary setting with 2 classes using autoregressive coefficients and kNN.Deng et al (2010) applied as a feature extractor the Hilbert spectrum and used a Bayesian classifier, achieving 66.5% accuracy in a binary setting with 2 classes, and DaSalla et al (2009) attained 71.33% accuracy in a binary classification with 3 classes and using CSP features with a SVM classifier.Despite having more classes, the accuracy achieved here is competitive, suggesting the strength of this approach even in scenarios with increased class complexity.
Finally, focusing on works dealing with binary classification, also marked in gray in table 3, Jahangiri et al (2018) achieved an accuracy of 82.5% with a binary classification using discrete Gabor transform, higher than the accuracy achieved here.However, in a multiclass scenario, the results with this method may be less favorable, so exploring their feature extraction technique for this scenario could be insightful.Nguyen et al (2017) achieved 50.1% accuracy with 3 classes and 66.2% accuracy in a binary setting with Riemannian manifold and RVM.This work outperforms theirs, indicating the strength of the feature extraction and classification methods employed.
Table 1.Accuracy (%) and standard deviation (std.dev) results for kNN classification for every subject and all recording sessions.The greatest average accuracy using kNN was obtained for the first recording of subject 7, while the lowest was the second recording of subject 5.

Subject
Ses  While the aforementioned comparative analysis provides valuable insights, it is important to acknowledge the inherent challenge of directly comparing the presented work with these studies due to variations in the utilized datasets.This study uses a recent dataset (2022), which stands out as a unique contribution in a relatively unexplored domain.To the best of the authors' knowledge, this dataset has only been analyzed in two other works.Gasparini et al (2022) employed power spectral density (PSD) for feature extraction and various classifiers, including SVM, XGBoost, LSTM, and BiLSTM, achieving accuracies ranging from 26.2% to 31.3%.In contrast, van den Berg et al (2021) utilized a CNNbased approach (EEGNET) and achieved an accuracy of 29.67%.When compared with these works on the same dataset, as shown in table 4, this study, employing the ITC framework and SVM/kNN classifiers, demonstrated superior performance with accuracies of 56.08% (kNN) and 59.55% (SVM).This discrepancy underscores the efficacy of the chosen methodology.
However, it is crucial to acknowledge the correlation between the extensive feature set (5760) and the relatively low average number of samples per class (50).This discrepancy leads to data sparsity, causing algorithms to be prone to overfitting.In fact, this can be proven by observing the average correlation squared (R 2 ) coefficient of the extracted features between different words, which has a value of 0.0163.Although the resolution of this issue is beyond the scope of this paper, we acknowledge that addressing this can lead to a better classification performance.
Moreover, it is important to note that a substantial proportion of frequencies exhibiting high ITC were concentrated within the gamma frequency band.This observation shows the importance of the gamma band in the context of recognizing imagined speech through inner speech decoding, which has also been detected by other authors using electrocorticogram (ECoG) to record the signals (Crone et al 2001, Towle et al 2008).The prevalence of coherent neural activity during EEG recordings within this frequency range and previous findings of other works with ECoG about the relevance of the gamma band in languagerelated tasks underscores its significance in capturing and characterizing the brain signals produced during inner speech processes.
In this study, Morlet WT is used to obtain the TFR.As mentioned in the introduction, this method  demands fewer computational resources compared to commonly used time-frequency transformations like FFT, but it does introduce increased computational complexity relative to time-domain analysis.It is of high importance to highlight our decision to address this challenge by employing classical machine learning algorithms.Unlike the deep learning approaches seen in comparative works, this work opted for simpler algorithms like SVM and kNN.It is important to note that our primary focus was not on minimizing computational time.However, this is a potential area for future exploration and refinement.The decision to prioritize accuracy over computational speed is aimed at achieving robust and reliable inner speech decoding, recognizing the inherent trade-off between complexity and efficiency in the current landscape of BCI.

Conclusion
This work proposed a novel feature extraction methodology for the imagined speech classification of four inner spoken words.The proposed method is based on the ITC framework, in which the characteristic time-frequency instances and phases of the EEG signals are fused to provide highly discriminative features.The classification algorithms employed were kNN and SVM.The obtained results were compared with works utilizing different datasets and the same dataset as this study.
When compared to works utilizing different datasets, this study outperformed most of them.However, a direct comparison with those works seems unfair and challenging to interpret due to differences in the datasets, number of classes, protocols, and techniques employed across studies.Nevertheless, when comparing the results of the proposed methodology with two other works using the same dataset, the present study yields significantly superior results in terms of accuracy.In fact, one of these works utilized the same dataset and SVM classifier, similar to the present study.However, the feature extraction technique employed was PSD, in contrast to the ITC used in this work.Specifically, in comparison to this classification task, our study achieved 33.35 more accuracy than the other, highlighting the capabilities of the proposed methodology.Therefore, the use of ITC as a feature extraction method demonstrates feasibility in inner speech classification.
Furthermore, it is crucial to note that, due to potential computational efficiency concerns, this method is not intended for utilization in online EEGbased BCIs yet.In fact, given the complexity of this field, there has been minimal research conducted, and the number of classes has been severely limited.For instance, in the work by Sereshkeh et al (2017), the authors achieved a binary EEG classification accuracy of 69% when classifying imagined words as 'yes' and 'no' .However, as highlighted by the authors, further online sessions and additional classes are imperative to advance toward clinical translation.
Other opportunities for future research include exploring deep learning algorithms, focusing on specific brain frequency bands, such as the gamma band, and analyzing phase coherence between channels.Moreover, we can use other performance evaluation models such as leave-one-out for additional insights into the algorithm's performance and robustness.Despite the limitations of a small dataset and potential computational efficiency concerns, this study contributes by being the first to utilize the ITC framework for feature extraction in inner speech decoding.However, these limitations emphasize the need for caution when interpreting results and potential variations depending on the dataset used.In conclusion, this study provides a novel method that demonstrates feasibility in inner speech classification and lays the foundation for future investigations into more sophisticated approaches that consider the consistency among EEG recordings.

Figure 1 .
Figure 1.General methodology flow diagram.The diagram moves through the four main stages of this study: database exploration, signal-preprocessing, signal processing, and classification.

Figure 2 .
Figure 2. Workflow for trial recording as presented by Nieto et al(2022).The initial 0.5 s of each trial displayed a white circle, defined as the concentration interval.Subsequently, during the cue interval, a triangle appeared, pointing in the direction of the word to be thought.Following this, a white circle reappeared, marking the commencement of the action interval during which the subject performed the inner speech process of the word without any prescribed rhythmic cue.Finally, a blue circle emerged, signaling the conclusion of the action and the onset of the rest interval.

Figure 3 .
Figure 3. Signal pre-processing flow diagram.The blue squares denote the inputs and outputs at various stages, while the purple ones symbolize the processes executed for signal preprocessing.The initial output consisted of raw EEG data with 128 channels and a 1024 Hz sampling rate, leading to the final output of EEG data featuring frequencies ranging from 0.5 Hz to 100 Hz and 128 independent components, sampled at 256 Hz.

Figure 4 .
Figure 4. Signal processing flow diagram for one subject.The yellow squares denote the inputs and outputs at various stages, while the orange ones represent the signal processing procedures.The input for this process consists of the filtered EEG signals from each recording trial, and the output is the feature matrix containing samples and their features, obtained through the ITC framework.
Oppositely, STFT provides equally spaced time-frequency localization (Kehtarnavaz et al 2019).This difference is caused by the WT utilizing wavelet functions that can adapt to different signal characteristics at various scales and time frames.Particularly, the Morlet Wavelet is commonly employed in analyzing neurosignals due to its capacity of allowing a balanced time and frequency resolution (Ali et al 2022).This key characteristic makes Morlet WT an excellent tool to allow localized feature extraction in frequency components across different time frames of brain signals.Although there are other wavelets, it is also a good starting point since it has been widely used in literature (González-Castañeda et al 2017, Yao et al 2021, Agarwal and Kumar 2022, Seleznyev et al 2023).

Figure 5 .
Figure 5. Unit circle projections for the (a) ITC representation of trials with low phase synchronicity and the (b) ITC representation of trials with high phase synchronicity.

Figure 6 .
Figure 6.ITC analysis of the 'right' word across all trials in the initial session of subject 1.The red spots indicate instances of higher inter-trial coherence in both time and frequency domains.The vertical lines delineate various intervals in the recording.The highlighted region, denoted as the interval of interest, spans from 1.5 s to 3.5 s.

Table 2 .
Accuracy (%) and standard deviation (std.dev) results for SVM classification for every subject and all recording sessions.The greatest average accuracy using SVM was obtained for the first recording of subject 3, while the lowest was the third recording of subject 6.

Table 3 .
Comparison of the mean accuracy (%) between this work and other works using different datasets.Yellow highlights works addressing multiclass classification with a significantly larger number of classes than the current study.Green highlights multiclass classification works with one additional class compared to this study.Works marked in gray either involve one fewer class than this study or focus on binary classification tasks.

Table 4 .
Comparison of the mean accuracy (%) between this work and other works using the same dataset as the present study (Nieto et al 2022).
algorithm and its different variants for disease prediction Sci.Rep. 12 6256 van den Berg B, van Donkelaar S and Alimardani M 2021 Inner speech classification using EEG signals: A deep learning approach 2021 IEEE 2nd Int.Conf. on Human-Machine Systems (ICHMS) (IEEE) pp 1-4 Yagura H, Tanaka H, Kinoshita T, Watanabe H, Motomura S, Sudoh K and Nakamura S 2020 Analysis of selective attention processing on experienced simultaneous interpreters using EEG phase synchronization 2020 42nd Annual Int.Conf. of the IEEE Engineering in Medicine & Biology Society (EMBC) (IEEE) pp 66-69 Yao B, Taylor J R, Banks B and Kotz S A 2021 Reading direct speech quotes increases theta phase-locking: Evidence for cortical tracking of inner speech?NeuroImage 239 118313 Zhao S and Rudzicz F 2015 Classifying phonological categories in imagined and articulated speech 2015 IEEE Int.Conf. on Acoustics, Speech and Signal Processing (ICASSP) (IEEE) pp 992-6