Continuous synthesis of artificial speech sounds from human cortical surface recordings during silent speech production

Objective. Brain–computer interfaces can restore various forms of communication in paralyzed patients who have lost their ability to articulate intelligible speech. This study aimed to demonstrate the feasibility of closed-loop synthesis of artificial speech sounds from human cortical surface recordings during silent speech production. Approach. Ten participants with intractable epilepsy were temporarily implanted with intracranial electrode arrays over cortical surfaces. A decoding model that predicted audible outputs directly from patient-specific neural feature inputs was trained during overt word reading and immediately tested with overt, mimed and imagined word reading. Predicted outputs were later assessed objectively against corresponding voice recordings and subjectively through human perceptual judgments. Main results. Artificial speech sounds were successfully synthesized during overt and mimed utterances by two participants with some coverage of the precentral gyrus. About a third of these sounds were correctly identified by naïve listeners in two-alternative forced-choice tasks. A similar outcome could not be achieved during imagined utterances by any of the participants. However, neural feature contribution analyses suggested the presence of exploitable activation patterns during imagined speech in the postcentral gyrus and the superior temporal gyrus. In future work, a more comprehensive coverage of cortical surfaces, including posterior parts of the middle frontal gyrus and the inferior frontal gyrus, could improve synthesis performance during imagined speech. Significance. As the field of speech neuroprostheses is rapidly moving toward clinical trials, this study addressed important considerations about task instructions and brain coverage when conducting research on silent speech with non-target participants.

neuronal populations [6]. It is generally accepted that human intracranial recordings have long-term stability [7] and provide adequate spatiotemporal resolution [8] to achieve high-performance decoding of speech processes and reliable communication at a reasonable speed [9].
Since the implantation of a wireless intracortical electrode in a patient with locked-in syndrome who achieved real-time vowel synthesis more than a decade ago [10], the field of speech neuroprostheses has continued its rapid transition toward clinical trials with more target patients currently benefiting from chronic brain implants [3][4][5]. However, enabling research with non-target participants remains essential to accelerate the development of such devices [11,12]. To date, the use of human intracranial recordings is mostly limited to clinical monitoring prior to brain resection for the treatment of intractable epilepsy [12,13]. In this context, patients are temporarily implanted with electrocorticography (ECoG) grids placed over cortical surfaces or stereoelectroencephalography (SEEG) depth electrodes that penetrate cortical tissues. Researchers usually have no control over the choice and placement of these electrode arrays, which depend on clinical requirements and the expertise of the medical team. Consequently, the coverage of brain areas involved in language production is typically sparse and inconsistent between patients.
Cortical activation patterns during overt speech production have been observed in various areas across the frontal, parietal and temporal lobes [14]. A review on the neural organization of word production [15] proposed that: anterior parts of the middle temporal gyrus (MTG) are involved in the early stages from conceptual preparation to lexical retrieval and selection; the posterior MTG is activated during phonological code retrieval; the inferior temporal gyrus (IFG) and the middle frontal gyrus (MFG) connect with the precentral gyrus (PreCG) for phonetic encoding and articulation; and the superior temporal gyrus (STG), known to process perceived speech [16], also governs self-monitoring across all stages [12]. The wide range of brain structures introduced in this simplified representation of language production reflects the complexity of decoding speech production from intracranial recordings.
Beyond the understanding of basic neural mechanisms, there are practical considerations for the design of speech neuroprostheses [17,18]. One of them consists in the choice of feedback modality for closed-loop operation [19]. The state of the art for visual feedback comes from the pioneering clinical trial in the field, in which the recruited participant was implanted with a chronic high-density ECoG array centered around the PreCG [3,4]. Initially, a trial-based classifier was trained over multiple months [3]. Attempted utterances were decoded from a set of 50 possible words and displayed on a screen to form sentences. In a follow-up study, the same participant went on to control a speller by attempting word codes representing individual English letters [4]. This technique virtually removed any limits on vocabulary size.
The best example of auditory feedback was achieved with a patient with intractable epilepsy temporarily implanted with SEEG depth electrodes [20]. After calibrating a decoding model using the patient's own voice within a few minutes, audio outputs were continuously synthesized in real time from SEEG signals during imagined speech production. Despite the relatively low intelligibility of these artificial sounds, the participant could experience a sense of agency over the system, which can be seen as a first step toward prosthetic embodiment [21]. Purely based on acoustic parameters of speech signals, this approach was well suited for research with non-target participants, as it did not rely on physiological models of the vocal tract and did not require electrode placement in specific brain areas. In addition, the decoding model was computationally inexpensive, realtime compatible, and could be trained using a small patient-specific dataset. The absence of a limit on vocabulary size was another strength, though it might have come at the expense of intelligibility. These practicalities led us to adopt this approach and to adapt the implementation to satisfy our own clinical constraints, following guidelines in previous work [22].
In this study, our first aim was to demonstrate the feasibility of closed-loop synthesis of artificial speech sounds from human cortical surface recordings during silent speech production. Our second aim was to provide appropriate physiological interpretation of our results based on individual contributions of neural features across all available recording sites. Our third aim was to compare our original ECoG dataset against an open SEEG dataset for speech production [23] to elaborate on the overall coverage of brain structures. It must be clearly noted that the effect of feedback on decoding performance was beyond the scope of our analyses.

Original ECoG dataset
Ten participants with intractable epilepsy (mean age 33.7 years (range 18-56 years); 3 female, 7 male) were temporarily implanted with ECoG electrode arrays (PMT Corporation, Chanhassen, MN, USA; Ad-Tech Medical, Oak Creek, WI, USA). Each participant had between 2 and 6 ECoG grids or strips for a total of 26-88 individual electrodes. Electrode placement was solely based on requirements of clinical monitoring. All participants, who were native speakers of Korean, joined the study on a voluntary basis and gave written informed consent.
Intracranial recordings (figure 1(A)) were acquired at 2 kHz using a Neuvo amplifier and (B) Audio recordings were acquired at 16 kHz using a directional microphone. (C) During overt and silent reading tasks, 108 words were individually displayed in a randomized order on a screen. Visual cues alternated between a fixation cross (1 s) and a word (3 s). (D) The closed-loop model was trained on data collected during overt reading and tested during both overt, mimed and imagined reading. Artificial speech sounds synthesized from ECoG recordings were presented through loudspeakers.
ProFusion EEG software (Compumedics Ltd, Melbourne, Australia). Acoustic contamination of intracranial recordings was ruled out following separate guidelines [24]. Real-time access to raw samples was made possible via a software development kit provided by the manufacturer. Audio recordings (figure 1(B)) were acquired at 16 kHz using a directional microphone. Participants were comfortably seated on their hospital bed and were instructed to read a list of 108 Korean words (table S1), individually displayed in a randomized order on a screen (figure 1(C)), under different speaking conditions: overt, mimed, and imagined. Words were selected from the literature on children's pain vocabulary [25] and communication during mechanical ventilation in intensive care units [26]. Visual cues alternated between a fixation cross (1 s) and a word (3 s). Synchronization between intracranial and audiovisual data was achieved using StimTracker (Cedrus, San Pedro, CA, USA). When applicable, artificial speech sounds synthesized from intracranial recordings were delivered through loudspeakers ( figure 1(D)).
In all speaking conditions, participants were instructed to articulate or imagine articulating the words as accurately as possible as soon as the word appeared on the screen. In mimed and imagined conditions, they were also instructed to imagine hearing their own voice. Five participants performed the mimed and imagined reading tasks in a closedloop fashion. They were continuously presented with audio outputs synthesized from the ECoG recordings. Unfortunately, feedback delay was too long and too inconsistent, which prevented us from analyzing the effect of feedback on decoding performance.

Open SEEG dataset
In the open SEEG dataset [23], ten participants with intractable epilepsy were implanted with SEEG depth electrodes (Dixi Medical, Besancon, France) at the Epilepsy Center Kempenhaeghe in the Netherlands. Intracranial recordings were acquired at 1024 Hz or 2048 Hz using SD LTM amplifiers (Micromed, Treviso, Italy). Audio recordings were acquired at 48 kHz using the built-in microphone of a laptop. All participants, who were native speakers of Dutch, were instructed to read aloud a list of 100 words selected from a Dutch corpus. Visual cues alternated between a fixation cross (1 s) and a word (2 s). The similarity in protocol design between our original ECoG dataset and this open SEEG dataset allowed for meaningful comparative analyses.

Anatomical labeling and brain visualization
Pre-implantation T1-weighted magnetic resonance imaging and post-implantation computed tomography were co-registered for the localizations of individual electrodes in respective centers for both ECoG and SEEG datasets. Anatomical labels were found using Talairach Daemon software [27]. In this study, anatomical analyses and interpretations were restricted to the third hierarchical level of the 3D Talairach atlas, which included 48 possible gyrus level labels. For visualization purposes, individual electrodes were plotted on a brain template using BrainNet Viewer [28].

Model for closed-loop audio synthesis from ECoG recordings
Model training and testing took place in hospital settings. For each ECoG participant, a model for closed-loop speech synthesis was trained using audio and intracranial data collected during an initial overt reading task. The model was then immediately tested during subsequent overt, mimed, and imagined word reading tasks. It consisted of two parallel decoders that operated with an update step of 16 ms. Implementation details were described in a previous publication [22] and are summarized here.
The first decoder continuously predicted a new frame of an audio spectrogram from the ECoG signals at each iteration of the loop. Specifically, a set of 125 eight-class linear discriminant analysis (LDA) classifiers were trained to predict discretized amplitude levels in 125 frequency bands along the spectral dimension of the audio spectrogram, as proposed in previous studies [20,29]. These bands were pre-selected between 55 and 7500 Hz using a logarithmic scale to account for human speech perception [30]. The predicted audio spectrogram was then continuously inverted to construct the corresponding waveform sampled at 16 kHz. Missing phase information was estimated using single-pass spectrogram inversion [31], which is computationally faster than the iterative Griffin-Lim algorithm [32].
The second decoder continuously updated a scalar value from the ECoG signals at each iteration of the loop. This value was obtained using multivariate temporal response function (mTRF) [33] and interpreted as the probability of speech occurrence. An adjustable threshold was then continuously applied on this value to mute or unmute the artificial sound delivered as auditory feedback through loudspeakers. This decoder removed background noise during expected periods of silence. In practice, the threshold was kept constant for the sake of interpretability.
To minimize computation times in the loop and avoid overfitting issues, model inputs for both decoders were identical and restricted to five neural features with a temporal context of 160 ms in the past. Consequently, the closed-loop system was never at risk of exceeding the update step of 16 ms and accumulating uncontrolled delays during the real-time synthesis of artificial sounds [22]. This conservative choice is further discussed in section 4.1.

Extraction and selection of neural features
In this study, a neural feature was defined as the time-varying power of the bipolar signal obtained from two adjacent electrodes within a specific frequency band. To estimate the power, a fifth-order Butterworth bandpass filter and the rolling variance using a window of 100 ms were applied to the bipolar ECoG signal. To determine the importance of a neural feature toward synthesis, the Pearson correlation coefficient was computed between the neural feature and the speech envelope throughout the whole session. There are many ways to implement speech envelope extraction from audio recordings [34][35][36][37] (see also figure S1 in supplementary data). Here, a fifth-order Butterworth bandpass filter between 55 and 7500 Hz and the rolling variance using a window of 50 ms were applied to the audio signal. The speech envelope was the only audio feature extracted for neural feature selection purposes.
During the real-time experiment, neural feature candidates were extracted from all pairs of adjacent electrodes and seven pre-selected neurophysiological bands (Theta: 4-8 Hz, Alpha: 8-12 Hz, Beta: 12-30 Hz, Gamma: 30-50 Hz, HG1: 70-110 Hz, HG2: 130-170 Hz, HG3: 70-170 Hz). For each ECoG participant, the best five candidates, ranked by their correlation with the speech envelope, were used as model inputs to both decoders. In offline analyses of our ECoG dataset and the open SEEG dataset, neural feature candidates were extracted from all pairs of adjacent electrodes and non-overlapping 10 Hz frequency bands from 0 to 250 Hz.

Objective evaluation of synthesis performance and feature contributions
Objective evaluation of synthesis performance was based on a measure of similarity described below between (1) a corrected version of the artificial sound in the overt, mimed or imagined task during model testing and (2) its corresponding true audio recording containing the participant's voice during model training. This evaluation was performed for each word separately, both for the waveform in the time domain and for each frequency component of its short-time Fourier transform.
The selected measure of similarity is referred to as 'DTW correlation' in our results. It was obtained in the time domain in three steps. First, a fast implementation of dynamic time warping (DTW), which correlates two time series that may vary in duration [38], was applied to the artificial sound predicted from brain signals against the true audio signal. Second, the artificial sound was temporally corrected by the DTW alignment found in the previous step. Third, the Pearson correlation coefficient was computed between the DTW-aligned artificial sound and the true audio signal, as proposed in a previous study [39]. Our choice of similarity metrics was motivated by the fact that DTW was preferred over audio spectrogram correlation [30,40] in a comparable study [20]. Nevertheless, it is important to note that DTW correlation should not be interpreted as a measure of intelligibility. For frequency components, a simple Pearson correlation was applied between true and predicted values.
Objective evaluation of neural feature contributions toward synthesis performance was based on the Pearson correlation between (1) the time-varying signal power representing the neural feature in the overt, mimed, or imagined task during model testing and (2) a binary time series representing expected utterance and silence periods. This evaluation was performed throughout the entire session. Mean onset time and duration of utterances were obtained from audio recordings in the overt reading task during model training for each participant.
A bootstrapping approach was used to generate random distributions for statistical evaluation of synthesis performance and neural feature contributions.
Significance level was set to 0.05. For synthesis performance, 100 random models were trained to generate a null distribution of random DTW correlation values. Random models were obtained by breaking the temporal alignment once at a random point between audio and intracranial recordings and swapping both partitions, as proposed in a comparable study [20]. Alternatively, ten random non-speech segments per word, outside of expected utterance periods, were used with the proposed model to generate a null distribution of random DTW correlation values. In both cases, after bringing together the DTW correlation values for all 108 words, a nonparametric Mann-Whitney U test was conducted to determine if artificial sounds synthesized under the proposed model were significantly above chance level. For neural feature contributions, the temporal alignment between audio and intracranial data was randomly broken 100 times to generate a null distribution of random correlation values. If the correlation value obtained using the correct alignment was greater than 95% of the random correlation values, the neural feature contributed significantly above chance level toward synthesis.

Subjective evaluation of synthesis performance
In addition to objective evaluation metrics, subjective human perceptual judgments were collected from a cohort of 26 healthy volunteers (mean age 28.3 years (range 20-48 years); 13 female, 13 male), who were all native speakers of Korean. Volunteers gave written informed consent and were compensated for their participation. They listened to artificial sounds previously delivered as auditory feedback to ECoG participants during the real-time experiment. For each artificial sound, they were allowed to listen to the audio as many times as they wanted, then had to pick one option in two-and five-alternative forced-choice tasks. In total, 60 artificial sounds were randomly selected among audible sounds from three ECoG participants (two female). Each artificial sound corresponded to a single word synthesized in either overt or mimed speaking condition.
Combining various designs proposed in previous studies [41][42][43], three separate tasks were implemented in the following order: syllable counting (60 sounds), word recognition (60 sounds), and gender recognition (24 sounds). In the first task, volunteers reported the perceived number of syllables, choosing between 1 and 5. In the second task, they indicated which of two proposed words the artificial sound resembled more. One proposed word was the correct answer, while the other was another randomly selected word with the same number of syllables from the list of 108 words. In the third task, they reported the perceived gender, either female or male. The graphical user interface was implemented in custom Python.
Due to the forced choice design of all three tasks, the probability of correct answers across all listeners followed a binomial distribution. Therefore, for a given artificial sound and a significance level set to 0.05, the corresponding word or gender could be detected significantly above chance level if the number of correct answers was 18 or higher (17.19 at 95%). Similarly, the corresponding number of syllables could be detected significantly above chance level if the number of correct answers was 9 or higher (8.55 at 95%).

Comparison of activation patterns between ECoG and SEEG datasets
A recording site was considered 'correlated with speech' if its corresponding feature extracted at the HG3 neurophysiological band (70-170 Hz) contributed significantly above chance level toward synthesis, following the bootstrapping method proposed in section 2.6. This analysis produced a simple ratio of recording sites correlated with speech for both ECoG and SEEG datasets and facilitated the visualization of these recording sites by gyrus type on brain templates.

Artificial speech sounds were synthesized during overt and mimed but not imagined speech production
Based on DTW correlation values averaged across all 108 words in the time domain, artificial sounds were synthesized significantly above chance level (Mann-Whitney U test, p < 0.001) from intracranial recordings during overt speech for three participants (figure 2(A), green; ECoG-04, ECoG-07 and ECoG-10) and during mimed speech for two participants (figure 2(A), orange; ECoG-04 and ECoG-07). Artificial sounds could not be synthesized during imagined speech for any of the participants (figure 2(A), red). For each participant, the locations of 3-9 electrodes, from which five neural features were selected, are visualized on brain templates in figure 2 For successful combinations of speaking condition and participant, the distribution of DTW correlation values across 108 words was significantly different (Mann-Whitney U test, p < 0.001) from null distributions generated either through random models or using non-speech segments (figure 2(C)). Because correlation values for audio signals in the time domain are difficult to interpret, even after applying DTW, these values were also computed for each frequency component of the audio spectrograms. In general, synthesis performance was significantly above chance level (bootstrapping, p < 0.05) for most of the 125 frequency components ( figure 2(D)).

Contributions toward synthesis performance came from high-gamma features in various brain areas
Neural features that correlated with the speech envelope during overt reading were mostly found within the high-gamma band in a variety of brain regions across the anterior frontal lobe (e.g., MFG), the posterior frontal lobe (e.g., PreCG), the anterior parietal lobe (e.g., postcentral gyrus (PostCG)), and the temporal lobe (e.g., MTG, STG, transverse temporal gyrus (TTG)). Based on Pearson correlation with expected utterance times, significant contributions toward synthesis (bootstrapping, p < 0.05) were different across participants and speaking conditions. When artificial sounds were successfully synthesized, selected neural features were mainly found between the PreCG, STG, TTG, and MTG during overt speech (figure 3: ECoG-04, ECoG-07 and ECoG-10) and between the PreCG, STG, and TTG during mimed speech (figure 3: ECoG-04 and ECoG-07). When the system failed to generate artificial sounds, selected neural features were found in a variety of brain areas including the MFG and PostCG. Possible reasons that can explain the failure to generate artificial sounds include low prediction accuracies from both LDA and mTRF decoders and ECoG signal instability across sessions.
In general, the absence of synthesized sounds during imagined speech could be interpreted as the result of lower cortical activations across selected neural features. However, this interpretation might be partially contradicted by individual features, such as the ones in the PostCG (figure 3: ECoG-05), which contributed significantly above chance level (bootstrapping, p < 0.05) toward synthesis across all speaking conditions, including in the imagined condition. Another possible interpretation is that contributions of potentially discriminative features involved in imagined speech were canceled out by the effect of other selected features. This issue was partially due to our feature selection process and model architecture, which were kept simple for real-time compatibility [22]. Future work will require more advanced strategies to account for the complexity of neurophysiological mechanisms underlying speech imagery processes [44]. Figure 4 provides an overview of individual results obtained with participant ECoG-10 during the realtime experiment. To achieve continuous synthesis of  artificial speech sounds from ECoG recordings, five neural features were automatically selected following audio-ECoG correlation analyses during initial overt reading without prior knowledge of patient-specific electrode placement ( figure 4(C), blue). These neural features were used as model inputs to train both LDA and mTRF decoders, which predicted audio spectrogram and speech probability, respectively. Figure 4(D) illustrates the process of model training with an example word, which was read aloud without feedback during initial overt reading (blue), followed by immediate model testing with the same word synthesized from ECoG signals during subsequent reading in three different speaking conditions: overt (green), mimed (orange), and imagined (red). Figure 4(E) provides an overview of all synthesized sounds, each row representing one of the 108 words. Neural activation patterns were later assessed during expected utterance times (figure 4(C)) to derive correlation values, which were interpreted as neural feature contribution toward synthesis performance after bootstrapping ( figure 4(F)).

Cortical activations in the STG probably reflected perception of own voice
For participant ECoG-10, all five selected neural features were found to represent cortical activations in multiple high-gamma bands (figure 4(B)) at localized recording sites between red electrodes in the STG ( figure 4(A)). Some recording sites had some overlap with the MTG and TTG. Artificial sounds were synthesized significantly above chance level (bootstrapping, p < 0.05) across all 108 words during overt speech production (figure 4(E), green) but not during silent speech production (figure 4(E), orange and red). In the analysis of neural feature contributions (figure 4(F)), the differences between overt and silent correlation values indicated that synthesis during overt speech was driven by all five features (Wilcoxon signed-rank tests, p < 0.05). Similar patterns were observed for neural features within the STG across multiple participants (figure 3(D), gray).
In general, STG activations were not observed during mimed or imagined speech production, although participants were instructed to imagine hearing their own voice. Therefore, in the absence of STG reactivation through hearing imagery [44], our results suggest that high-gamma cortical activations in the STG most likely reflected actual perception of the participant's own voice. Figure 5 provides an overview of individual results obtained with participant ECoG-07 during the real-time experiment. For this participant, three of five selected features were found to represent highgamma ( figure 5(B)) cortical activations within the STG ( figure 5(A), black) and the remaining two within the PreCG ( figure 5(A), brown). Some of the STG recording sites had some overlap with the MTG and TTG. Artificial sounds were synthesized across all 108 words during both overt and mimed speech production (figure 5(E), green and orange), but not during imagined speech production ( figure 5(E), red). In the analysis of neural feature contributions ( figure 5(F)), the differences between overt and silent correlation values indicated that synthesis during mimed speech was driven by the PreCG, but not by the STG (Wilcoxon signed-rank tests, p < 0.05). Similar patterns were observed for neural features within the PreCG across multiple participants ( figure 3(D), brown including participant ECoG-04 despite a more ventral placement of electrodes within the PreCG ( figure 3(A), brown).

Cortical activations in the PreCG probably reflected articulatory movements
In general, PreCG activations were not observed during imagined speech production, although participants were instructed to imagine moving their own speech articulators. Therefore, in the absence of PreCG reactivation through articulation imagery [44], our results suggest that high-gamma cortical activations in the PreCG most likely reflected executed movements of various components of the vocal tract. In addition, as illustrated by the timing of synthesized sounds in figure 5(E), patient-specific characterizations of the timing of cortical activations indicated that PreCG activations took place prior to speech onset, while STG activations occurred after speech onset. This further supported our interpretation of results, which would be consistent with the literature [45,46]. In future work, articulatory measurements or high-quality video recordings focusing on speech articulators could help rule out the effects of other mechanisms in the early stages of language production, such as phonological encoding.

Naïve listeners correctly identified about a third of artificial speech sounds in two-alternative forced-choice tasks
Human perceptual judgments were obtained from a cohort of 26 healthy volunteers, who completed three separate forced-choice tasks: syllable counting, word recognition, and gender recognition. Figure S5 provides details about listening counts and answering times. The correct number of syllables was detected significantly above chance level (binomial, p < 0.05) for 24 of 60 sounds tested (40%), the correct word was picked significantly above chance level (binomial, p < 0.05) for 22 of 60 sounds (37%), and the correct gender was reported significantly above chance level (binomial, p < 0.05) for 20 of 24 sounds (83%) ( figure 6(A)). For the selection of artificial sounds in the forced-choice tasks, there was no significant difference in terms of syllable or word recognition between overt and mimed speaking conditions (Mann-Whitney U test, p > 0.05). The high score obtained for gender recognition suggests that artificial sounds may have preserved a basic form of speaker identity. Subjective human perceptual judgments did not correlate with objective DTW correlation values for any of three tasks (p > 0.05).
Examples of artificial speech sounds synthesized during our real-time experiments are provided in figure 6(B) and movie S1. The first four sounds in figure 6(B) corresponded to participant ECoG-10 and were synthesized exclusively from the STG during overt speech production. As explained in section 3.3,  these sounds were most likely generated because of perception of own voice. Despite objective and subjective quantifications, the design of our study did not allow us to determine whether STG activations reflected encoding of onset or sustained speech [47,48]. In addition, there was no behavioral evidence whether volunteers could discriminate phonetic features of speech by their manner of articulation [49] or prosodic features of speech through pitch variations [46]. The last four sounds in figure 6(B) corresponded to participant ECoG-04 and were synthesized exclusively from the PreCG during overt and mimed speech production. As explained in section 3.4, these sounds were most likely generated because of articulatory movements. Similarly, it is unclear whether PreCG activations reflected the movements of specific components of the vocal tract, and no data were available to support the recognition of phonetic or prosodic patterns by naïve listeners. However, it is interesting to note that our results were obtained from a onestage brain-to-audio decoding approach that did not involve intermediate articulatory features [50].

Dataset comparison highlighted sparse coverage of speech-related brain structures
The anatomical locations of intracranial recording sites that correlated with speech were compared between our original ECoG dataset and an open SEEG dataset specifically made available for studying overt speech production [21]. The purpose of this analysis was to highlight the sparse coverage of speech-related brain structures in our ECoG dataset, not to recommend either modality for any applications. For both modalities, a total of ten recruited participants in a unique center was not representative of variations across centers in different countries [13].
In the SEEG dataset (figure 7(A), left), our proposed bootstrapping approach showed that highgamma features extracted from 79 of 1103 recording sites (7.16%) were positively correlated with the speech envelope significantly above chance level (bootstrapping, p < 0.05). In the ECoG dataset ( figure 7(A), right), high-gamma features extracted from 36 of 604 recording sites (5.96%) satisfied this requirement. The ratio of correlated recording sites was slightly higher in the SEEG dataset, suggesting that intracranial recordings of deeper brain structures, as part of clinical epilepsy monitoring, might offer a greater variety of features to train synthesis models. However, the distributions of DTW correlation values across all recording sites for both datasets did not differ significantly (Mann-Whitney U test, p > 0.05), suggesting that our original ECoG dataset constituted a reasonable dataset to conduct the present study.
A spatial evaluation of correlated recording sites revealed that four brain structures at the gyrus levelnamely, the insula, the claustrum, the inferior parietal lobule (IPL) and the IFG-were found in the SEEG dataset but not in the ECoG dataset. While the insula and the claustrum are deeper brain structures that cannot be physically accessed using ECoG grids, high-gamma cortical activations in the IPL and IFG were not significantly correlated with the speech envelope (bootstrapping, p > 0.05). This observation is surprising as the IFG partially colocalizes with the traditional Broca's area, which plays a central role in speech production [51]. However, taking a closer look at the distribution of electrodes, this result might be explained by the absence of coverage of posterior parts of the IFG and the MFG bordering the PreCG ( figure 7(A), right). Other correlated recording sites were found bilaterally in both datasets in known areas involved in speech production, such as the STG, PreCG, MTG, MFG and PostCG.
A spectral evaluation of correlation values with the speech envelope confirmed that neural activations related to speech production were most prominent within the HG3 band (70-170 Hz) in both datasets. In figure 7(B), this result was illustrated by dark red patches between dashed lines representing positive correlations, but also by dark blue patches representing negative correlations. Positive correlations can be interpreted as neural activations that occurred simultaneously with overt utterances [6]. In contrast, negative correlations were likely due to neural activations preceding or following utterances, creating a temporal pattern inversely reflecting speech envelope. This result was observed more frequently in the SEEG dataset because of the faster timing of visual cues in the protocol. Interestingly, dark blue patches were also observed in lower frequency bands (0-20 Hz) across various brain structures in both datasets, possibly due to event-related desynchronization in alpha and beta bands during articulatory movements [52]. These low-frequency features were not exploited in the present study. Further work will be required to determine whether they could be useful to improve prediction accuracies, although it is unlikely that they reflect fine details of speech production mechanisms [52].

Discussion
Closed-loop synthesis of whispered and imagined speech processes from intracranial recordings has been previously reported in a single participant implanted with SEEG depth electrodes that penetrated cortical tissues in various brain areas [20]. Here, the proposed approach was adjusted and replicated in ten participants implanted with ECoG grids and strips placed over cortical surfaces.

Feature selection and model architecture can be optimized
The implemented model for closed-loop speech synthesis was trained on minimal patient-specific data and was immediately ready for testing in hospital electrode on a brain template, either on the left (L) or right (R) hemisphere. Its color indicates its corresponding brain structure at the gyrus level (e.g., blue for Superior Temporal Gyrus). In the SEEG dataset, 79 of 1103 available recording sites (7.16%) were significantly correlated with the speech envelope during overt speech production. In the ECoG dataset, this number was 36 of 604 (5.96%). (B) For each SEEG or ECoG participant, the corresponding heatmap contains audio-iEEG correlation values for all neural feature candidates, extracted from a bipolar recording site (x-axis) and a 10 Hz frequency band (y-axis). Frequency range was from 0 to 250 Hz. The high-gamma band between 70 and 170 Hz lies between the dashed black lines.
settings. It was well suited for research with nontarget study participants, as it did not rely on physiological models of the vocal tract, intracranial recordings in specific brain areas, or laborious trial repetitions for categorical labeling (e.g., phonemes, words, sentences). These practicalities made the approach language-independent and limitless in terms of vocabulary size, but also came at the expense of prediction accuracy. In terms of feature selection, there was a computational trade-off between the number of neural features to extract and the number of speech bands to predict [22]. Restricting the model to five neural feature inputs in favor of a greater spectral resolution of predicted outputs turned out to be a reasonable choice, as only a handful of electrodes were sufficient to produce artificial sounds that were correctly identified by naïve listeners about a third of the time. From a translational perspective, this also suggests that practical systems may be developed with a small channel count [2].
Increasing the size of the training dataset, as well as the number of neural feature inputs, would certainly improve the accuracy of discretized values in the predicted audio spectrograms [29], which may enhance the intelligibility of audible outputs. Using different sets of features with specific time delays to predict different speech bands would also improve the perceptual quality of artificial speech sounds. As illustrated in figure S2, a spectrotemporal receptive field analysis [35], which accounts for the causal relationship between audio and intracranial signals, would be well suited to this kind of feature selection process. In addition, dimensionality reduction techniques, such as principal component analysis, could be applied to minimize the amount of redundant information across selected neural features, as illustrated in figure  S3. However, principal components are relatively expensive to compute, which may constitute a challenge for real-time operation. Currently, the most important issue in this high-dimensional parameter optimization problem is probably the absence of any suitable intelligibility-based loss functions to train speech synthesis models [53].
In terms of model architecture, LDA classifiers combined with spectrogram inversion constituted the most practical workaround to overcome our clinical constraints and minimize model calibration time at the patient's bedside. However, as outlined in a review [54], several recent studies have proposed deep learning approaches that significantly improved the quality of artificial sounds synthesized from intracranial recordings [41,43,50,55,56]. These deep neural networks have not yet been applied both in real time and during silent speech production, but there is no doubt that they would lead to more intelligible artificial sounds. Therefore, future directions include making them compatible with our closedloop implementation.

Mimed and imagined utterances might be optimally synthesized from different brain areas
Despite the simplicity of our feature selection process and decoding approach, our results demonstrated the feasibility of closed-loop synthesis of artificial sounds during mimed speech production in two participants. These artificial sounds were correctly identified by naïve listeners about a third of the time, as supported by subjective human perceptual judgments obtained from a cohort of 26 healthy listeners. To explain this more objectively, the confusability of artificial sounds could be evaluated at different levels. For example, as illustrated in figure S4, forced alignment of individual phonemes using the Montreal Forced Aligner [57] could be applied on audio recordings during overt speech to assign phonetic labels to corresponding segments of intracranial data. Further investigations, which go beyond the scope of this study, could determine whether selected neural features were associated with specific phonetic features of speech during synthesis.
From a neuroanatomical perspective, artificial sounds during mimed speech were synthesized from high-gamma activation patterns in several recording sites in the PreCG, which likely reflected executed movements of various components of the vocal tract [45,46]. Interestingly, the location of electrodes within the PreCG was more ventral for participant ECoG-04 and more dorsal for participant ECoG-07. The fact that artificial sounds were successfully synthesized in both cases might relate to recent evidence supporting the coexistence of two distinct speech coordination systems [46,58] and the concentric organization of face representations centered around the tongue in the PreCG, which sharply contrasts with the linear organization from lips to larynx in Penfield's classical homunculus [59].
Artificial sounds could not be synthesized during imagined speech production despite clear task instructions. However, post-hoc analyses uncovered consistent patterns of cortical activations during imagined speech at localized recording sites in the PostCG, MFG and STG. These observations are in line with previous work on perceptual reactivation in sensory cortices following mental imagery of speech [44]. This study would suggest that articulation imagery led to stronger activations in the PostCG and the MFG, whereas hearing imagery did so in the STG. Also, pointing toward the idea that imagined speech might be better decoded from sensory areas, a recent study [60] found that a small set of words could be consistently classified from human intracortical recordings in the supramarginal gyrus, posterior to the PostCG and mainly thought to be involved in language processing.
Another recent study that involved SEEG participants [61] formulated the attractive concept of a nested hierarchy between overt, mimed, and imagined speech, in which relevant channels for decoding the lower behavioral mode (e.g., imagined speech) form subsets of relevant channels for the higher behavioral mode (e.g., mimed speech). Here, we argue that this is a feature engineering problem that depends on the design of the protocol. In the present study, this concept did not hold true due to our feature selection process based on audio-ECoG correlations during overt reading. For example, useful features for decoding silent speech have been missed due to complex interactions between language production and perception processes during the task, as well as nonlinearities introduced by the corollary discharge circuit that dissociates self-generated sounds from external ones [62].

Model training on overt speech data prevents quick clinical translation of our findings
The idea that a speech synthesis model can be trained on overt speech and tested on silent speech is helpful when there is limited time for data collection. Assuming long-term stability of ECoG recordings, patients in early stages of neurodegenerative diseases who are still capable of articulating intelligible speech would benefit from this approach. However, this very approach also constitutes the biggest barrier to translation for patients who have already lost their ability to speak. A suitable yet time-consuming workaround would be to design an auditory repetition task, in which patients listen to voice recordings for model training then attempt articulation as closely as possible during closed-loop synthesis.
Another limitation of our study resides in the lack of evidence that our model would work on a different day or even just a few hours after parameter optimization. Due to the instability of neural recordings, the need for frequent model recalibration is often identified as an major issue to clinical translation [63]. Finally, it is unclear whether our model tested during word reading would produce the same performance during free speech in the absence of visual cues.

Minimal feedback delay will improve sense of agency
In closed-loop speech BCI systems that rely on visual feedback [3,4], it is possible to use language models to apply corrections on inaccurate predictions displayed on a screen. In contrast, when choosing to deliver real-time auditory feedback, it is not possible to modify artificial sounds perceived in the past. Therefore, any adjustments to acoustic predictions must be applied in real time, for example, through parallel phonetic or prosodic predictions. In the latter case, it would be conceivable to apply real-time pitch shifting to create artificial tones [64] or intonation patterns [46] that carry additional meaning or emotion. Alternatively, the continuous estimation of intermediate articulatory features, fed into a physical model of the vocal tract, would likely generate more natural-sounding speech outputs [50].
The best indication that a closed-loop BCI system provides meaningful feedback is the participant's ability to voluntarily modulate their brain signals to achieve a high degree of control, which is also referred to as the sense of agency [21]. In our study, feedback delay was too long and too inconsistent to address its effect on decoding performance. Therefore, future work will not only focus on integrating objective measures of intelligibility in the training process, but also on minimizing feedback delay below 50 ms [65] toward a seamless integration of the system such that it may be treated as part of the body.

Data availability statements
The data cannot be made publicly available upon publication because they contain sensitive personal information. The data that support the findings of this study are available upon reasonable request from the authors.