Machine learning decoding of single neurons in the thalamus for speech brain-machine interfaces

Objective. Our goal is to decode firing patterns of single neurons in the left ventralis intermediate nucleus (Vim) of the thalamus, related to speech production, perception, and imagery. For realistic speech brain-machine interfaces (BMIs), we aim to characterize the amount of thalamic neurons necessary for high accuracy decoding. Approach. We intraoperatively recorded single neuron activity in the left Vim of eight neurosurgical patients undergoing implantation of deep brain stimulator or RF lesioning during production, perception and imagery of the five monophthongal vowel sounds. We utilized the Spade decoder, a machine learning algorithm that dynamically learns specific features of firing patterns and is based on sparse decomposition of the high dimensional feature space. Main results. Spade outperformed all algorithms compared with, for all three aspects of speech: production, perception and imagery, and obtained accuracies of 100%, 96%, and 92%, respectively (chance level: 20%) based on pooling together neurons across all patients. The accuracy was logarithmic in the amount of neurons for all three aspects of speech. Regardless of the amount of units employed, production gained highest accuracies, whereas perception and imagery equated with each other. Significance. Our research renders single neuron activity in the left Vim a promising source of inputs to BMIs for restoration of speech faculties for locked-in patients or patients with anarthria or dysarthria to allow them to communicate again. Our characterization of how many neurons are necessary to achieve a certain decoding accuracy is of utmost importance for planning BMI implantation.


Introduction
Human communication is largely based on speech.Deprivation of speech faculties is therefore a major cause of unhappiness of locked-in patients [1].Speech brain-machine interfaces (BMIs) are devices aimed to bypass damaged parts of the human speech system, and translate the patient's speech-related neuronal activity into the corresponding artificial speech sounds.
The first single unit BMI in the human was implanted by Kennedy and colleagues [21,22], in a speech-related area in the left precentral gyrus.The BMI translated neural signals in the brain of a locked-in patient who tried to speak, into continuous control parameters for a speech synthesizer.Following patient practice, the hit rate reached 70%.A recent study synthesized speech directly from neural activity in the dorsal motor cortex during a long passage reading task [23].Other studies have decoded the neuronal activity generated during perception of human speech, in both macaques [24] and humans [25].
In several studies, electrocorticography (ECoG) was decoded offline to infer parts of speech, for example vowels [26] or complete words [27][28][29][30].However, these studies made use of a small dictionary for decoding.A recent study employed a small dictionary, and using next-word probability could predict sentences in real time [31].Recently, ECoG served in decoding when the user silently mimed sentences [32].The decoded signal was used for synthesizing speech that follows the articulatory dynamics of a speaking individual.Notwithstanding, this type of BMI could not be activated by imagined speech, due to the lack of articulatory movements.Another use of ECoG was to infer answers to questions which the patient produced or perceived [33].Even though this study employed prior knowledge about the probability of answers to specific questions, the resultant accuracies were only 61% and 76%, respectively.ECoG, mainly from the speech motor cortex, also served to reconstruct the sound waveform which was fed into a speech synthesizer [34].The reconstruction in this study was highly coupled to the sound signal produced by the user and did not employ a general representation of parts of speech.This approach is less suitable for implantation in silent paralyzed persons.
Scalp electroencephalography (EEG) has also been recently utilised for decoding vowels, with classification accuracy of 68.9% for five Bengali vowels [35].Another recent study reached 92.2% accuracy, but for three vowels only: a, im, and u [36].
In our previous study, we found single units that encode the production of vowel sounds in the human rostral anterior cingulate and adjacent medial orbitofrontal cortex (rAC/MOF) and in the superior temporal gyrus (STG) [37].This study also uncovered the different types of encoding employed in the two regions.Whereas neurons in the former area were sharply tuned, i.e. showed highly specific responses to one or two vowels, neurons in the latter were broadly-tuned, with sinusoidal modulation over the vowel space, in analogy to the well known sinusoidal tuning curves of hand movements in the primary motor cortex [38] (recently, two more neural population dynamics features previously reported for arm movements were also found during speaking [39]).Prediction of the five monophthongal vowel sounds by decoding the aforementioned neuronal activity resulted in an accuracy of 93% (cross-validated).This high accuracy was achieved by a novel machine learning decoding algorithm, named Spade, that utilises sparse decomposition of the high-dimensional neuronal feature space [40].
In another study, we found single neurons in the subthalamic nucleus (STN) that encode vowel phonemes during speech production, perception and imagery [41,42].However, in contrast with the homogeneous encoding in the rAC/MOF and STG regions, some subthalamic neurons were sharplytuned, whereas others, broadly-tuned.We decoded the subthalamic neuronal activity to infer the five monophthongal vowel sounds [43].Spade outperformed all other algorithms compared with and predicted the vowels with 100% accuracy for speech production, 96%, for perception, and 88%, for imagery.In the perception condition, the accuracy was linear in the amount of neurons employed for decoding, whereas in production and imagery, it increased faster than linear.
Another brain region shown to be involved in language and speech is the left thalamus.Thalamotomy of the ventralis intermediate nucleus (Vim) resulted in deterioration of temporal processing of words, semantic comprehension, syntax construction, verbal memory, and name generation [44].The thalamus is known to be part of the articulatory networks [45] and an integrative hub for functional brain networks [46].Stimulation of the thalamus was shown to evoke alterations in object naming, anomia and perseveration [47].It systematically reacts to syntactic and semantic parameters of language in coordination with cortical regions [48][49][50][51].Motoric speech and prosody were also demonstrated in the thalamus [48,52].We recently found single neurons that encode vowel phonemes during speech production, perception and imagery also in the left Vim of the thalamus [53].This neuronal population contained both sharply-and broadly-tuned units.
Thalamic neuronal activity has been decoded to infer tactile information in the rat [54].FMRI studies have demonstrated that the thalamus is involved in the regulation of slow cortical potentials that control an EEG-based brain-computer interface (BCI) [55,56].The lateral geniculate nucleus of the cat thalamus has served as an input source for reconstructing input images in a visual BMI [57].
Despite the aforementioned studies utilising thalamic neural activity as input for several types of BMIs, the potential of the thalamus, and specifically the left Vim, to serve as an implantation site for speech BMIs is yet unknown.Here, we decode the electrical activity of single neurons in the human left Vim to predict speech features during speech production, perception and imagery.We compare the Spade algorithm [40] with other supervised machine learning algorithms for speech decoding.For a practical speech BMI, we also evaluate the amount of left Vim neurons necessary to achieve a certain level of decoding accuracy.This is aimed to provide neurosurgeons with the information of how many left Vim units are required for high accuracy speech BMI.

Patient population
Subjects in this study were 8 neurosurgical patients, 4 with Parkinson's disease and 4 with essential tremor, undergoing implantation of DBS electrodes (3389, Medtronic Inc., Dublin, Ireland) or RF lesioning (5 patients) in the left Vim for clinical reasons (4 male, 6 right-handed, 1 ambidextrous, 74.1 ± 1.6 years old, disease duration 14.1 ± 3.4 years (1 patient unknown, so average disease duration was computed over 7 patients only)) [53].Movement disorders neurologists at our medical centre diagnosed none of the patients to have speech disorders in presurgical clinical assessments.No sedative or anaesthetics were administered during or immediately before the neuronal recordings.Medications were withheld overnight before surgery.All patients were selected for analysis.

Neurophysiology
Our clinical procedure followed that of [41][42][43]53].In brief, two microelectrodes (NeuroProbe-Sonus, Alpha Omega, Nof Hagalil, Israel) were temporarily implanted to locate the thalamus, based on neural activity patterns, using the Neuro Omega recording system (Alpha Omega, Nof Hagalil, Israel), advancing from 25 mm above the planned target along the selected trajectory, in 0.1-0.2mm steps, until the exit from the thalamus (as detected by microelectrode recordings).The BenGun was oriented in an anteroposterior direction with the second electrode posterior to the central tract in order to assess the distance from the ventral caudal nucleus.Electrodes were localized to the Vim using the THOMAS atlas [58] (see [53] for details).
We used conscious sedation with propofol sedation for the incision and burr-hole craniotomy.We used less than 1% propofol at a low dose of 1 mg kg −1 per hour.The context sensitive half time of propofol is very short, especially since we administered it only for a short time.Sedation was stopped immediately after completion of the drilling.The exact times between removal of sedation and beginning of experimental recording were not logged in our study, but in general, they ranged roughly between 15-20 min.We verified patients were awake and cooperative before starting microelectrode recordings.
Bandpass filtered signals were spike sorted using WaveClus.The classification by the automatic algorithms was manually revised by an experienced researcher (AT).The SUMU algorithm decided which units were single units [59] and was manually verified.Stability of cell waveforms was assessed based on maximal cross-correlation between the mean waveforms of time bins of 30 s, based on the method of [60].Firing rate stability was evaluated according to the slopes of linear fits of successive average firing rates computed in 30 second baseline time bins [61,62].A unit was considered stable if none of the slopes was significantly different from 0. Sound and neuronal recordings were synchronized.All studies were approved by and conformed to the guidelines of the Medical Institutional Review Board in Tel Aviv Sourasky Medical Center and all patients provided written informed consent to participate in the study.The research was conducted in accordance with the principles embodied in the Declaration of Helsinki and in accordance with local statutory requirements.

Experimental paradigms
The paradigm consisted of 3 experimental conditions: speech production, perception, and imagery.Each condition had 5 parts, according to the 5 monophthongal vowels sounds: a /ä/, e / ̞e/, i /i/, o / ̞o/, u /u/.In each part, following an oral instruction, a sequence of randomly spaced (2-3 s) auditory cues ('beeps') were played.Following each beep, the patient produced, perceived, or imagined articulating the instructed vowel sound.Listening was to an audio recording via loudspeakers.In the imagery task, patients were asked to imagine themselves speak, but without any lip or other articulatory movements.The imagery condition was monitored by an experimenter to include no lip movement.To control for pure auditory responses to the beep itself, at the beginning of each session the patient listened to a sequence of beeps.All instructions were in Hebrew.
Patients performed a total of nine sessions.Each patient repeated each vowel between 5 and 10 times ('trials') in each of the three experimental conditions-production, perception, and imagery.

The data
This paper utilizes the same dataset that served for identifying the population of thalamic speechencoding neurons [53], for the decoding task.
For each trial, firing rates were computed in 100 ms bins aligned to speech onset for production and perception, and to the beep, for imagery (where onset is unavailable).Trials longer than 2 s were omitted, due to lack of patient response or excessively long delay in response, usually indicating lack of attention to the task.
Only the first 5 valid trials of each vowel sound (in chronological order) were analysed so that all sessions had the same number of trials (i.e. for the 5 vowels, a total of 25 trials per experimental condition, per patient).
Examples of data, in the form of raster plots and peri-stimulus time histograms of left Vim neurons during speech production, perception, and imagery, appear in our earlier study [53].

Decoding algorithms
Our decoding algorithm ('Spade') has been presented in [37,40,43].In brief, the algorithm uses sparse decomposition to project the high dimensional neuronal feature space onto a lower dimensional space.For this, it utilises a regularized multivariate linear solver, which minimizes ∥W∥ L1,2 under the constraint ∥AW − B∥ F ⩽ σ where A and B are matrices containing the input feature vectors and low dimensional label code vectors as their rows, respectively, W is a weight matrix learnt by the decoder, and σ is a parameter (see [40] for details).The decoder thus simultaneously reduces dimensionality and automatically selects task-relevant features.This algorithm already served to successfully decode speech-related neuronal information from rAC/MOF, STG and other frontal and temporal lobe areas in epilepsy patients [37], as well as the STN in Parkinson's disease patients [43].

Feature selection
We employed a dynamic feature selection method based on two pre-defined time intervals: a baseline period preceding an event (e.g. the auditory cue), and a response period.The method divides the two intervals into 100 ms bins, averages the firing rates in bins of the baseline period, and searches for the bin of maximal firing rate within the response interval.It then averages the maximal firing rate and the firing rate in its successor bin.
The baseline and response time intervals were defined as follows.For speech production, where speech-related neuronal activity is expected before and after speech onset, we examined the response period between (−300) and 500 ms and compared it with a baseline period between (−800) and (−300) ms.Perception of speech, on the other hand, is expected to take place only after speech onset, as speech is produced by another individual.We therefore selected the period between 0 and 500 ms after speech onset as the response period and a baseline between (−500) and 0 ms.Imagery of speech was aligned to the auditory cue ("beep") because the onset time of imagined speech was unknown.Hence, we examined response period between 0 and 500 ms relative to a baseline between (−800) and (−300) ms.

Classification algorithms
We compared the Spade decoder with the following classification algorithms: linear classifier, support vector machine (SVM), k-nearest neighbours (KNN), and decision tree.
For SVM and KNN, we examined two variants each: SVM with linear and Gaussian kernels, and KNN with Euclidean and cosine distance functions.This resulted in a total of seven variants of algorithms.From each of these variants, we created three variants based on the following inputs to the classifier (total 21 variants): the raw firing rate, the normalized firing rate, and the principal components of the firing rate function (see [43] for details).All hyperparameters were the same as in [43].When the algorithm converged, the last model generated was always the one employed.
In addition to classic machine learning methods, we also compared the decoder with a deep learning algorithm for classification, namely a convolutional neural network (CNN).The network architecture as well as its training procedure are detailed in the supplementary methods.The input to the CNN was either the raw or normalized firing rate.

Statistical analysis
Leave-one-out cross-validation served in all runs of the decoders.The accuracies of the cross-validation runs were compared against the 20% chance level (5 balanced classes) in a one-sided one-sample t-test.When p < 0.05, decoding accuracy was considered significantly above chance.A two-sample t-test served to compare the best algorithm with the second best.
To assess significance without any parametric assumptions, we also compared the accuracy of each decoder against an empirical chance level.This chance level was computed per algorithm based on a null distribution of its own accuracies (following cross-validation) when data labels were randomly shuffled 13 times.The above t-test served also for comparison with the empirical chance level.In addition, for the Spade algorithm, that was found to be the best algorithm (see results), we repeated the shuffling 1000 times, and compared its accuracy with the 99th percentile of the null distribution to assess significance without any parametric assumptions.
The whole analysis was implemented in MATLAB (Mathworks, Natick, MA).For brevity, throughout this paper we will employ the terminology 'decode speech production/perception/imagery' , to strictly mean to decode the neuronal activity recorded during the condition of speech production/perception/imagery and predict the vowel that the patient produced/perceived/imagined, respectively.

Decoding each session by itself
For a practical speech BMI system, the decoder should be trained on data of the same patient, i.e. without pooling.We therefore ran the dynamic normalized Spade algorithm repeatedly, each time with a different amount of units from the same session.For robustness, for each amount of neurons we ran the algorithm multiple times, each time with a different set of randomly-selected neurons.Thus, if in a specific session, k units were recorded from, then for each amount of units, n (1 ⩽ n ⩽ k − 1), we ran the algorithm k times, each with a different, randomlyselected set of input neurons, for each experimental condition (production, perception and imagery).For n = k, the algorithm was run once per condition.As described above, each run was cross-validated.

Decoding the pooled-sessions dataset
For the pooled-sessions dataset, we repeated the aforementioned procedure for single sessions, as if the whole dataset of 129 neurons came from a single session.

Results
We recorded the neuronal activity of 129 units of the left Vim of 8 patients during 9 sessions.Waveforms for the same neuron on different 30 second time bins tended to be very similar with 91% of r values > 0.95, compared with only 5% of similarity scores calculated between waveforms of different cells [60].Waveform similarity in the two waveform populations (same cell vs. different cells) was significantly different (twosample t-test, p = 7.1 × 10 −19 ).All recorded units had stable average firing rates (100%; 129/129).
The accuracy of the normalized Spade algorithm was significantly higher than that of the second best algorithm for production (normzlized CNN: p = 0.0003) and perception (linear SVM 5 : p = 0.042), but equated with the CNN algorithm on the imagery data (p = 0.29).

How does the amount of input units affect decoding accuracy?
For a real-life speech BMI, we would like to estimate how many left Vim neurons are required in order to achieve a certain level of decoding accuracy.

Single-session decoding
We computed, for each session by itself, the average decoding accuracy as a function of the amount of neurons (figure 2).The figure demonstrates that 5 Linear SVM is the 2nd best perception algorithm that does not belong to the Spade family of algorithms.The non-normalized Spade is not significantly different from the normalized one (p = 0.082).speech production could be decoded significantly above the chance level in all sessions.For speech perception and imagery, decoding accuracy was significant in six of the sessions each.In order to estimate which aspects of speech will be decodable significantly above chance in an 'average session' , i.e. when implanting in a random location within the Vim, (assuming our dataset to be representative of the Vim), we averaged the accuracy functions of each session for each aspect of speech.On average, only speech production could be decoded above chance (figure 3).

Pooled-session decoding
In the pooled dataset, accuracy in decoding speech production was higher than that of decoding perception or imagery, for all amounts of units examined (figure 4).Decoding perception and imagery equated with each other for the lower amounts of units (n ⩽ 40).For higher amounts of neurons, perception was decoded with higher accuracy than imagery.For production, accuracy was significantly above the chance level when the input consisted of 3 units or more, and for perception or imagery, 4 units or more.Accuracy changed as a logarithmic function of the amount of units for all aspects of speech, with very high correlation coefficients between the logarithmic regression lines (marked in figure 3) and the actual accuracy functions: production: 0.995 (p = 2.5 × 10 −127 , for testing the hypothesis of no correlation), perception: 0.997 (p = 1.1 × 10 −141 ), and imagery: 0.996 (p = 1.1 × 10 −136 ).For speech production and perception, the highest accuracy was achieved when utilising all available units, achieving: 100% and 96%, respectively.The figure shows that in order to achieve accuracy of 90% in decoding speech production or perception, one needs 106 or 124 units, respectively.To achieve the highest accuracy on imagery data (80%), 121 units are required.

Contribution of individual sessions to decoding
Examination of the weights of the normalized Spade decoder provides an important insight into the features that the decoder employs, and specifically, about the patient to whom these features belong (figure 5).The figure shows that each patient contributed features that were utilized for decoding of 2 or 3 aspects of speech.The correlation coefficients between pairs of decoder weights of different speech aspects also reveal contributions of individual sessions to decoding of multiple speech aspects (figure 6).The high correlations between the weights of the decoders for perception and imagery (especially in sessions 2-4 and 9) may explain why the decoding accuracy functions of perception and imagery equate with each other for the low amounts of input units (see figure 4).Independently of the amount of input units, speech production was decoded best whereas perception surpassed imagery for high amounts of neurons (n > 40) and equated for lower amounts.For all aspects of speech, accuracy changed logarithmically with the amount of neurons (regression lines are marked in the respective colours).

Order of the vowel phonemes in the neuronal representation
To understand whether there is any preference of certain vowels over others in the neural code of the left Vim, we investigated misclassification errors based on the confusion matrices of the dynamic normalized Spade algorithm (figure 7).Whereas a, i, and o were perfectly decoded in all aspects of speech, e and u were confused with each other in perception and imagery, as can be seen in the confusion matrices.This may indicate that the two vowels share a similar neuronal representation in Vim leading to their confusion.

Discussion
This research decoded the electrical activity of single neurons in the human thalamus, and specifically in the left Vim, to predict which vowel sound patients produced, perceived, or imagined.The Spade algorithm, when ran on normalized dynamic features, achieved very high accuracies: 100% for speech production, 96% for perception, and 80% for imagery.It outperformed all other machine learning algorithms compared with, in decoding production and perception of speech, and equated with a deep learning algorithm on the imagery data.These promising results may, in the future, be developed into speech BMIs that will allow locked-in patients or patients with anarthria or dysarthria to voluntarily communicate again by artificial speech.Hence, thalamic speech information may be combined with other speech neuroprostheses that are based on, for example, spiking activity in the ventral premotor cortex [63], ECoG in the sensorimotor cortex [64], or single unit activity in frontal lobe and hippocampus [65].The development of a network-level speech The vertical red lines, emphasized by red arrows, separate weights of features of the same patient.Patient IDs appear above the relevant weights.Some patients contributed to decoding all aspects of speech (e.g.patients 4, 5, and 9) whereas others, to only two aspects (e.g.patient 1, to perception and imagery, and 6, to production and imagery).BMI, i.e. a neuroprosthetic device that will cover multiple parts of the speech network, is expected to be more robust.Our study indicates that the thalamus, being a hub of speech information in this thalamocortical network, may provide information that is decodable with high accuracy.
The finding that the 'average session' is expected to allow the decoding of speech production information above the chance level reflects the fact that speech production was decodable above chance in all individual sessions, whereas perception and imagery, only in about half of the sessions.
For an informed decision of the number of electrodes to implant in a real speech BMI, the neurosurgeon will need an estimate of the achievable decoding accuracy as a function of the number of recorded neurons.We therefore estimated the average accuracy achieved by the normalized Spade decoder for each amount of input units.When decoding speech production, accuracies were higher than for perception or imagery, independently of the amount of units.This may be due to the clinical targeting of motor parts of the left Vim.Nonetheless, the high accuracy rates when decoding perception and imagery (96% and 80%, respectively) indicate the involvement of the left Vim also in non-motor tasks as well.Decoding perception and imagery equated with each other for low amounts of units.This may imply that in a subpopulation of left Vim units, the same unit is involved in the two aspects.This corollary is further strengthened by the high correlations between the weights of the perception and imagery decoders.
We found that to achieve 90% decoding accuracy, one needs about 107-125 left Vim units.From the graphs, we can estimate the number of units required for a perfect decoding of perception of speech to be around 140-150 units, and around 200 for imagery.These numbers are much higher than routinely implanted during neurosurgeries in any single movement disorders patient.Therefore, for real-life speech BMIs the number of depth electrodes in the left Vim will have to increase to get a reasonable BMI operation, assuming the same type of electrodes is employed.This is further supported by the accuracies obtainable by single-session decoding which are not enough for real life speech BMIs (figure 2).New recording technologies may, in the future, allow for simultaneous recording of the electrical activity of thousands of single units by a single electrode (e.g.Neuropixels 2.0 [66]).This technology may keep the number of implanted electrodes per patient similar to today's (1 or 2 electrodes) while recording from orders of magnitude more neurons which should provide the high accuracy indicated by our research.
The amount of left Vim units required in order to get 90% decoding accuracy of the Spade algorithm for the five monophthongal vowels (107-125 units) is about half of the amount of subthalamic units that yield similar performance (210-230 unit) [43].When utilising 123 rAC/MOF units along with 32 STG units as inputs to the algorithm, accuracy remained as low as 80% [37].Although this may be explained by the fact that only 13 rAC/MOF and 24 STG units were clearly speech-related, these areas seem to have lower potential to serve as inputs for speech BMI systems.Even when units of the last two areas were combined with other units of the temporal and frontal lobes (dorsal and subcollosal anterior cingulate gyrus, entorhinal cortex, hippocampus, amygdala, and parahippocampal gyrus), to form a total of 606 input units, decoding could only achieve 93% accuracy.This renders the left Vim a more prominent site for implantation of speech BMIs than all aforementioned regions.
Notwithstanding, cortical areas other than rAC/MOF and STG may also be candidates for implantation.
Decoding the activity of a population of left Vim neurons results in high accuracy for all three aspects of speech: production, perception, and imagery.Notwithstanding, each specific unit by itself has its own 'preferred' aspects which can be decoded above the chance level.This has been demonstrated by examining the weights of Spade decoders for different aspects of speech, that showed to which aspect of speech each individual neuron contributed and what the correlations between different aspects were (figure 5).We also demonstrated how small groups of highly specific units can allow decoding at relatively high accuracies, for example 12 units yielded 70% accuracy in decoding speech production (figure 2, Patient #3; chance level is 20%) or 21 units which resulted in 60% accuracy in decoding speech imagery (figure 2, Patient #8).
Decoding a pooled dataset may have an advantage over decoding a single session with the same amount of units because a single session may suffer high noise correlations.Hence, a pooled dataset recorded at a variety of recording sites may yield better decoder performance.As such, it may serve as an initial approximation for the required amount of neurons.
The activity of Vim units has been pooled from 8 patients of two different populations, Parkinson's disease and essential tremor, due to the scarcity of clinical options to record from the human Vim.However, when aiming for speech BMIs, the target population for implantation, i.e. completely paralyzed individuals, is expected to suffer neither of these diseases.Therefore, one may expect higher decoding accuracy with the target patient population.As our previous paper found differences in firing rates and percentages of speech-related units between the two patient populations recorded from [53], differences in decoding may also exist and are subject for future research.
The left Vim employs two different encoding schemes: sharp tuning and broad tuning [53], similar to the STN [41].These encodings therefore suggest that, as in the STN, discrete decoding, meaning classification of vowels, is more suitable for this area.Continuous decoding, on the other hand, may be more appropriate in brain areas encoding continuous speech parameters.An example of such an area is the motor cortex whose neuronal activity served as an input for a decoder of continuous auditory parameters for a real-time speech synthesizer [21].
When examining the encodings of specific vowel sounds, our findings suggested that the thalamic representations of the vowels e and u are similar, and therefore resulted in higher confusions between them by the decoder.Stated otherwise, the vowels a, i, and o were better decoded than e and u.However, as the overall accuracy was very high, more data is necessary in order to decide whether this is a result of different representation of the vowels in the left Vim, or merely a limitation of the available amount of units.
A practical aspect that future speech BMI systems in the left Vim will have to face is the real-time processing of data.An important stage in the BMI is spike sorting, here performed offline.Although some existing spike sorting methods can be applied for real time BMIs [67,68], another option is to employ unsorted spike trains.Some studies even suggest that the latter may turn out to be superior to the former in real-time prediction of hand and finger movements in non-human primates [69][70][71].

Conclusion
Our study renders the left Vim of the thalamus a valuable source of input to single unit speech BMI systems and exhibits very high accuracy in decoding its neuronal activity during speech production, perception, and imagery, using the normalized Spade algorithm with dynamic feature selection.This bears an enormous potential for speech BMI systems that will allow completely paralyzed persons to express themselves voluntarily using artificial speech.
Moreover, our research provides the neurosurgeon with estimates of the required amount of thalamic neurons for high accuracy decoding.Similar approaches were previously reported in the STN [43] and precentral gyrus [72].Our study indicates that implantation in the left Vim requires far less input neurons than in the STN, rAC/MOF, STG or the other frontal and temporal lobe areas mentioned above.Suggesting a novel target for speech BMI and quantifying its advantages, this study is an important step towards speech BMI systems.

Figure 1 .
Figure 1.Comparison of algorithm accuracy in decoding speech production (a), perception (b), and imagery (c) of the 5 vowel sounds.The normalized Spade algorithm performed best in all conditions and achieved 100%, 96%, and 80% accuracy, respectively.For each algorithm, a black horizontal line denotes the empirical chance level (i.e.mean of the random shuffling distribution).The theoretic chance level is 20%.

Figure 2 .
Figure 2. Decoding accuracy as a function of the amount of neurons for neurons recorded in the same session.Asterisks denote average decoding accuracy significantly above the chance level.The standard error of the accuracy function (over all runs of the algorithm, for each amount of units) is displayed as a bright background around the graphs.Each run employed a different randomly-selected set of units and was also cross-validated.The red horizontal dashed line denotes the chance level.

Figure 3 .
Figure 3. Average accuracy (over sessions) as a function of the amount of input units.Decoding is based on units recorded in the same session only.Only speech production could be decoded significantly above chance (marked by red asterisks), except for perception whose decoding was slightly above chance with nine input units.The figure is organized similarly to the panels of figure 2.

Figure 4 .
Figure 4. Average accuracy of decoding by dynamic normalized Spade as a function of the amount of input left Vim units.Independently of the amount of input units, speech production was decoded best whereas perception surpassed imagery for high amounts of neurons (n > 40) and equated for lower amounts.For all aspects of speech, accuracy changed logarithmically with the amount of neurons (regression lines are marked in the respective colours).

Figure 5 .
Figure 5. Absolute weights of the normalized Spade decoders.(a) For speech production.(b) For perception.(c) For imagery.The decoders trained in the first cross-validation round are presented (arbitrary selection).The vertical red lines, emphasized by red arrows, separate weights of features of the same patient.Patient IDs appear above the relevant weights.Some patients contributed to decoding all aspects of speech (e.g.patients 4, 5, and 9) whereas others, to only two aspects (e.g.patient 1, to perception and imagery, and 6, to production and imagery).

Figure 6 .
Figure 6.Correlation coefficients ρ (absolute values) between pairs of decoder weights, each trained on a different aspect of speech, for each session and vowel phoneme: (a) production and perception.(b) Production and imagery.(c) Perception and imagery.Very high correlations between the weights of the perception and imagery decoders are presented.For certain vowel phonemes, nontrivial correlations exist also between production and perception decoders as well as between production and imagery decoders.The decoders are the same ones as in figure 5.

Figure 7 .
Figure 7. Confusion matrices of the dynamic normalized Spade algorithm decoding speech production (top), perception (middle) and imagery (bottom).The vowels have been ordered so that the confusion matrices are diagonal or band-diagonal.Percents were computed from the true number of trials in each condition.