Text and image generation from intracranial electroencephalography using an embedding space for text and images

Objective. Invasive brain–computer interfaces (BCIs) are promising communication devices for severely paralyzed patients. Recent advances in intracranial electroencephalography (iEEG) coupled with natural language processing have enhanced communication speed and accuracy. It should be noted that such a speech BCI uses signals from the motor cortex. However, BCIs based on motor cortical activities may experience signal deterioration in users with motor cortical degenerative diseases such as amyotrophic lateral sclerosis. An alternative approach to using iEEG of the motor cortex is necessary to support patients with such conditions. Approach. In this study, a multimodal embedding of text and images was used to decode visual semantic information from iEEG signals of the visual cortex to generate text and images. We used contrastive language-image pretraining (CLIP) embedding to represent images presented to 17 patients implanted with electrodes in the occipital and temporal cortices. A CLIP image vector was inferred from the high-γ power of the iEEG signals recorded while viewing the images. Main results. Text was generated by CLIPCAP from the inferred CLIP vector with better-than-chance accuracy. Then, an image was created from the generated text using StableDiffusion with significant accuracy. Significance. The text and images generated from iEEG through the CLIP embedding vector can be used for improved communication.


Introduction
Invasive brain-computer interfaces (BCIs) are devices designed to allow severely paralyzed patients, such as those with amyotrophic lateral sclerosis (ALS), to continue to communicate [1].Patients with severe paralysis have been shown to communicate using chronically implanted BCIs that use signals from intracranial electroencephalography (iEEG) of the motor cortex [2][3][4][5].The text generated from the iEEG is effective for communication.Recently, a combination of deep learning and natural language processing (NLP) significantly improved the speed and accuracy of text generated using the iEEG of paralyzed patients [6,7].The practical speed and accuracy of the communication make invasive BCIs indispensable to patients despite the risks inherent to the invasive nature of the procedures [8].
To date, most invasive BCIs operate by extracting motor information from the sensorimotor cortex [6,7,9].However, decoding primary motor cortical activity may be difficult for ALS patients, especially those in a totally locked-in state, the target condition for invasive BCIs [10].BCIs for communication using signals other than those of the motor cortex are necessary for patients with a variety of functional impairments [11].One candidate is an imagery BCI using iEEG of the visual cortex [12].Images decoded from visual cortex iEEG can be intentionally controlled to display new images that better represent the intended meaning of the patient [12].For example, a subject implanted with electrodes in the temporal cortex succeeded in showing images representing 'landscape' when instructed to produce 'landscape' images.By showing the images, one can show their intention by the meaning of the images.However, it is not clear how accurately the decoded image conveys the patient's intent.
Recently, multimodal generative models have made it possible to create images and text from a common latent space, such as contrastive learning [13].Contrastive language-image pretraining (CLIP) is a multimodal model trained on 400 M image-text pairs to encode images and text into a common representational space [14].The visual encoder of CLIP provides a vector representing visual and semantic information for any given image.CLIP has been used to generate images from text and vice versa [15,16].CLIPCAP uses the CLIP vector to generate a caption for the image corresponding to the CLIP vector using a pretrained language model (GPT2) [15].Additionally, a diffusion model using CLIP successfully generated photorealistic images representing the CLIP vector [17].When the CLIP vector can be accurately specified, it is possible to generate text and images that express the corresponding meaning more precisely.
In addition, CLIP has been used to generate images from brain signals derived from modalities such as functional magnetic resonance imaging (fMRI) [18].Moreover, text can be generated from fMRI using CLIP and an encoding model while the participant is watching movies [19], although text with arbitrary meaning cannot yet be decoded.Previous studies using fMRI have also shown that visual cortical activity during image perception can be used to generate sentences about the perceived image [20,21].Such multimodal estimation of text and images from brain signals can be applied for communication in paralyzed patients, although fMRI is not a realistic tool for daily use.Multimodal decoding based on CLIP from iEEG signals may be a feasible method for creating a communication tool to generate text and images based on the patient's intentions.
Here, we hypothesized that a decoder that infers the CLIP vector from iEEG signals could generate text to represent the perceived images and generate images of the same meaning based on the generated text.The accurate generation of text and images representing the intended meaning of the patient from their iEEG signals will make imagery BCIs reliable communication tools.To assess the accuracy in generating text and images, we used the electrocorticographic signals (ECoGs) of 17 patients recorded from the occipital and temporal cortex while watching a 60 min video.The CLIP vector for each video frame was inferred from the ECoGs.Then, text was generated using CLIPCAP based on the inferred CLIP vectors.Finally, images were generated using StableDiffusion based on the inferred text.We compared the accuracy of identifying the original images based on the generated text and images to assess how much of the information was preserved through text and image generation from ECoGs.

Subjects and ECoG measurements
We analyzed the ECoGs that we used in our previous study [12], which were recorded from 17 subjects (E01-E17) while they watched the same 60 min video.All participants were recruited from among those implanted with electrodes in the occipital and temporal cortices at three hospitals in accordance with the experimental protocol approved by the ethics committee of each hospital (Osaka University Medical Hospital: Approval No. 14353, UMIN000017900; Juntendo University Hospital: Approval No. 18-164; Nara Medical University Hospital: Approval No. 2098).The implanted subdural electrodes were clinically approved electrodes arranged in both grid and strip configurations with an intercontact spacing of 7 or 10 mm and a contact diameter of 3 mm.Prior to the experiment, written consent was obtained from all subjects after the nature and possible consequences of the study were explained.
The videos consisted of 224 short clips, each of which was culled from 75 trailers or behind-thescenes videos of 70 films or animation videos downloaded from Vimeo.The clips had a median length of 16 s and were consecutively combined to create six 10 min videos.These videos contained a wide range of semantic content: nature, animals, food, people, and text.The videos did not contain duplicate scenes.
The ECoGs measurements were obtained from the subjects as they lay in bed or sat in a chair.A computer screen was placed facing the subject to show the videos.During the experiment, an EEG-1200 (Nihon Koden, Tokyo, Japan) recorded the ECoGs at 10 kHz.As a reference potential, we used the average potential of two electrodes placed subcutaneously or on the cortex outside the visual cortex.Overview of our proposed text generation method.ECoGs were recorded while the subject watched movies composed of multiple scenes.Short videos were extracted from the 60 min movie every second.Then, a scene representing each 1 s short video was extracted from its middle frame.The scenes were then converted into CLIP image vectors using a CLIP image encoder.The ECoGs were used to infer the CLIP image vector corresponding to the scene.Then, CLIPCAP was applied to both the CLIP image vector and the inferred CLIP vectors to generate annotations from these vectors.The generated annotations were converted into CLIP text vectors to evaluate the scene identification accuracy.In addition, a human annotation was created in Japanese by an annotator for each image and then translated into English by Google Translate.The human annotations were also converted into a CLIP text vector and used to evaluate scene identification accuracy.
A low-pass filter with a cutoff frequency of 3000 Hz and a high-pass filter with a cutoff frequency of 0.16 Hz were used during the recording.The timing of the visual stimulus presentation was monitored by a DATAPixx3 (VPixx Technologies, Quebec, Canada) so that a digital pulse at the visual timing was recorded synchronously with the ECoGs.

Generation of text and images from ECoGs 2.2.1. Experimental paradigm for generating text and images from ECoGs
We created text and images from the ECoGs (figure 1).First, text was created from the ECoGs while watching the videos.The created text was compared with the text generated from the presented image using a CLIP encoder and the text selected from the annotations made by humans.Then, the created text was used to generate images.We compared the images generated from the ECoGs, original images, and annotations to assess how much of the image information was decoded using the CLIP model.

CLIP encoder
A vision transformer (ViT) [22] with a 16 × 16 input patch size was used as the image encoder for CLIP.The text encoder is a transformer [23] with 12 layers and 8 attention heads.Pretrained OpenAI models available on GitHub (https://github.com/openai/CLIP) were used without additional training.
The 60 min video presented to the patient was split into 3600 nonoverlapping 1 s short videos.From each video, a still image (scene) representing the 1 s short video was extracted from its middle frame and encoded by the CLIP encoder, yielding an image vector of the scene.

ECoG decoder
We trained a decoder to infer the image vector of the scene from the decoding features calculated from 500 ms ECoGs before and after the time at which each scene was presented (a total of 1000 ms ECoGs).To acquire the decoding features, the ECoGs were processed as follows.(1) The ECoGs of each patient were first downsampled to 1 kHz using the decimate function of MATLAB.(2) The decimated signals were then rereferenced by common averaging with visually determined noise-free channels.(3) The rereferenced signals were bandpass filtered for the high-γ (80-150 Hz) frequency band using the pop_eegfiltnew function of EEGLAB (https://sccn.ucsd.edu/eeglab/index.php).The filter order was set to 1000.(4) The filtered signals were then subjected to Hilbert transformation to acquire the instantaneous amplitude of the signals.(5) The amplitudes were averaged over a 1 s time window to obtain decoding features.Thus, the decoding features for each scene are of the same length as the number of channels.(6) The decoding features from all 17 subjects were finally concatenated in the channel direction for further decoding.
CLIP vectors were then inferred from these decoding features using ridge regression via ten-fold nested cross-validation to avoid overestimation of decoding accuracy.In the nested cross-validation, the training samples of an outer fold were used to find the best regression model hyperparameter (λ) with inner cross-validation, and the model trained with the optimal hyperparameter and all training samples was applied to the test samples of the outer fold.In this way, information leakage during hyperparameter optimization was avoided.For nested crossvalidation, the same division of the samples (scenes) was used as in our previous study; the samples were divided into 10 groups so that (1) scenes from the same video clip were grouped in the same group and (2) the number of samples in each group became almost equal.Thus, the nested cross-validation consisted of 10 outer folds and 9 inner folds.First, the CLIP vectors of the 3600 scenes were z-standardized along each dimension using the mean and standard deviation across all scenes, such that each dimension had a mean of 0 and a standard deviation of 1.To train a regression model, the decoding features of the training samples were z-standardized along each dimension using the mean and standard deviation among the training samples; a regression model was then trained to infer the standardized CLIP vectors from the standardized decoding features.To infer the CLIP vectors from the decoding features of the test samples, the decoding features were z-standardized using the same mean and standard deviation calculated within the decoding features of the training samples before applying the regression model; the inferred (standardized) CLIP vectors were then restored to their original scale and bias by multiplying and adding the same standard deviation and mean calculated with the CLIP vectors of the 3600 scenes.During the nested cross-validation, λ parameters were optimized from candidates 10 −8 , 10 −7 ... 10 8 to minimize the average of the mean squared error for each dimension.

Caption generator
CLIPCAP was used to generate text from the inferred CLIP vector.CLIPCAP is an image captioning model that utilizes CLIP and GPT-2.In this model, images are transformed into image embeddings by the visual encoder of a pretrained CLIP model.This CLIP image embedding is mapped to prefix embedding using a multilayer perceptron-based mapping network.This network can be trained to generate a prefix embedding with semantic information.Then, an autoregressive model predicts the next token from the prefix embedding.Finally, a caption is generated from the predicted tokens using GPT-2 tokenizers.CLIPCAP was trained using the training dataset from MSCOCO2017 (supplementary figure 1) [24].The training was set to take place for a maximum of 10 epochs.Minibatch learning was used, and the batch size was set to 40 during training.AdamW [25] was used as the optimization function; the linear scheduler was used to increase the learning rate to 0.00002 in 5000 steps and then to decrease it linearly to 0 after 10 epochs of training.
During training, the loss was validated using the MSCOCO2017 validation dataset every 1000 iterations.Early stopping was applied so that the optimal parameters could be used during inference; training was terminated when the value of the loss did not decrease for five consecutive validations, and the parameters of the model with the lowest loss recorded were saved.The saved parameters were used during inference.

Human annotation datasets
Annotations for the scenes were created in Japanese by 24 annotators.Five annotators annotated each of the 3600 scenes.Each annotation contained multiple sentences.These annotations were converted to English using the Google Translate API with Googletrans (https://github.com/ssut/pygoogletrans).To narrow the annotations to one for each scene, the five annotations were divided into sentences, and the sentence with the longest number of words was considered the final annotation for that scene (supplementary figure 2).Here, we used the sentence with the longest number of words because the longest sentence was supposed to have the most information.The selected sentence was converted into a CLIP vector using the text encoder of CLIP to assess the accuracy of the text generation.

Image generation
StableDiffusion [16] was used to generate images from the generated text.In our experiment, the model was implemented using Python library diffusers.
The image generation method is illustrated in figure 2. Generated annotations from human annotators, the presented images, and the ECoGs were used as prompts for the diffusion model to generate the images.The generated images were then converted into CLIP vectors using the CLIP image encoder, and the accuracy of the image generation was assessed.

Evaluation method and statistical testing
The accuracy of the text and image generation was evaluated for four sets of features: (1) CLIP image features predicted from the ECoGs, (2) text generated from CLIP features of the corresponding images, (3) text generated from CLIP image features inferred from the ECoGs, and (4) CLIP features of the text annotated by humans.Cosine similarity was used to evaluate the similarity between the predicted vector and the true vector in the CLIP feature space.Scene identification accuracy was determined by comparing the cosine similarity between positive and negative example vectors for each predicted vector and indicating the percentage of positive examples that have a larger value of cosine similarity than negative examples.The CLIP vector for each 3600-scene was considered the true vector for each sample.
To test the statistical significance of the scene identification accuracy for each set of features, we performed a permutation test.The true vectors of the scenes were randomly shuffled to create shuffled true vectors.Then, the same procedures of evaluation were applied for the shuffled data to obtain the scene identification accuracy of chance level.We compared the scene identification accuracies between the true data and shuffled data using the Mann−Whitney U test.In addition, the Kruskal−Wallis test with the Steel-Dwass post hoc test was performed to examine whether there was a statistically significant difference in the scene identification accuracies of CLIP image or text features from the ECoGs, the corresponding images and the text annotated by humans.The alpha level was always 0.05.

Results
The CLIP vectors were estimated from the ECoG decoder using the high-γ amplitude of the ECoGs.The scene identification accuracy for the estimated CLIP vectors was 0.698 ± 0.285, which was significantly greater than that of the shuffled data (U(n1 = n2 = 3600) = 8924 946, p < 0.05, Mann−Whitney U test; also see supplementary figure 3).
CLIPCAP generated text from the CLIP image vectors obtained from the original images and those inferred from the ECoGs during the presentation of visual stimuli using the ECoG decoder.Figure 3 shows some representative examples of the generated text from the ECoGs and original images and the text from the human annotations of the images.Some text generated from the ECoGs successfully captured the semantic contents of the presented image, such as brushing teeth, man and woman.However, some text did not represent the presented image well.
The accuracy of each generated text was compared using the text encoder of CLIP.Each text was input to the CLIP text encoder to obtain a CLIP text vector.The cosine similarity between the obtained CLIP text vector and the CLIP vector of the presented images was evaluated.The scene identification accuracy, that is, the probability that the cosine similarity is greater than that of images other than the presented images, was evaluated (figure 4).Scene identification accuracies were significantly greater than chance for the human-generated sentences (0.948 ± 0.131, U(n1 = n2 = 3600) = 12 144 730, p < 0.05, Mann−Whitney U test), sentences generated from the original images (0.968 ± 0.088, U(n1 = n2 = 3600) = 12 396 640, p < 0.05, Mann−Whitney U test) and text generated from the ECoGs (0.663 ± 0.287, U(n1 = n2 = 3600) = 8517 910, p < 0.05, Mann−Whitney U test).The accuracy of the generated text was high for all three sentences; those generated with CLIP had the highest accuracy, followed by the human annotations and the text generated from  the ECoGs (H(2) = 4729, p < 0.05, Kruskal−Wallis with Steel-Dwass post hoc test).
We also generated images from the predicted text (figure 5(a)).The images generated from the predicted text represent the meaning of the predicted text.The generated images had semantic content similar to that of the original images when the generated text correctly captured the meaning of the true image.Scene identification accuracies using the CLIP image vectors for the generated images were significantly greater than those using the shuffled data for all conditions: the image generated from the human annotation (0.831 ± 0.240, U(n1 = n2 = 3600) = 10 552 917, p < 0.05, Mann−Whitney U test), the CLIP annotation (text generated from the CLIP vector of the original image) (0.821 ± 0.244, U(n1 = n2 = 3600) = 10 430 677, p < 0.05, Mann−Whitney U test), and the ECoG annotation (text generated from the CLIP vector inferred from the ECoGs) (0.612 ± 0.299, U(n1 = n2 = 3600) = 7826 826, p < 0.05, Mann−Whitney U test).The identification accuracy of the CLIP vector inferred from the ECoGs was significantly lower than that of the other vectors (H(2) = 1728, p < 0.05, Kruskal−Wallis with Steel-Dwass post hoc test) (figure 5).However, the accuracy of the CLIP image vector generated from the generated text (0.618 ± 0.289) was similar to that of the CLIP text vector inferred from the ECoGs (0.658 ± 0.279).

Discussion
We demonstrated that CLIP, CLIPCAP and the StableDiffusion model can be used to generate both text and images reflecting the meaning of visual stimuli from ECoGs of the visual cortex.Although the identification accuracy using ECoGs was significantly lower than that using the CLIP image vector of the original image, we successfully decoded ECoGs with visual semantic information that could be translated into both text and images.Multimodal information could be output via CLIP vectors decoded from ECoGs.The proposed method for generating text and images based on the common embedding space for text and images can be applied to develop a BCI to support communication for paralyzed patients by decoding their imagery content from iEEG.
Previous studies have demonstrated that BCIs can generate text from ECoGs or intracortical signals [7,26,27].The text was generated based on sensorimotor activities related to mouth movements and hand movements using a deep learning model with NLP [6,7].The accuracy of the generated text was evaluated mostly by the word error rate, the number of words correctly generated with respect to the true sentence, and the word generation speed (depicted in words per minute).Using micro-ECoG over the sensorimotor cortex controlling the mouth, a stroke patient succeeded in generating 78 words per minute with a word error rate of 25%, both of which are sufficient for daily communication [26].However, the word error rate cannot be assessed for our method because the decoding accuracy depends on the accuracy of predicting the CLIP vector.The high identification accuracy of our results demonstrated that the CLIP vector decoded from the ECoGs has sufficient data to discern semantic differences among the presented images.In addition, text of 48 ± 10 words was generated from individual CLIP vectors inferred from 1 s ECoGs; that is, 1 s ECoGs generated 48 ± 10 words, which corresponds to 2880 ± 600 words per minute.Although successive ECoGs might have similar semantic information, information transmission based on the latent space inferred from ECoGs can contribute much more efficiently to the transmission of intentions than the direct generation of text based on motor functions.However, the accuracy of estimating imagined sentences remains limited [26].In a previous study, it was suggested that the use of a closed loop can increase the accuracy of BCIs [12].In addition, although ridge regression is a common algorithm for decoding fMRI signals with visual stimuli to generate text [20,21] and images [18,28], other machine learning algorithms can improve the accuracy of inferring the CLIP vector from ECoGs.Some previous studies using ECoGs succeeded in improving the decoding accuracy using variational Bayesian decoding [29] and long short-term memory with/without a recurrent neural network [30][31][32].Moreover, some signal feature extraction methods, such as dynamic mode decomposition [33] and convolutional neural networks [34,35], also improve the decoding accuracy.Combining these methods to improve decoding accuracy allows severely paralyzed patients to make sentences based on their imagery even when their motor cortical activities are damaged.Our results suggested that visual decoding is a promising target for supporting communication in paralyzed patients.
Various studies using fMRI combined with diffusion models have been proposed to generate images from brain signals.For example, the latent vectors of StableDiffusion were estimated from the fMRI signal to generate images [18,28,36].Although these methods do not use text to generate images, inferred latent vectors, or CLIP vectors, are used to extract semantic information from images.In addition, another study proposed generating text from fMRI data and generating images from them, similar to our method [35].When we compare these methods using fMRI to our method using ECoGs, it should be noted that fMRI can simultaneously record wholebrain activity, including lower and higher visual areas that infer the lower visual features and semantic information to generate the precise image, which has basic visual features and is semantically correct.In contrast, electrocorticography has limited access to the cortical area simultaneously recorded, especially to the lower visual areas.Therefore, in this study, we did not infer the lower visual features from the ECoGs but inferred only the semantic information by generating text from the ECoGs.Electrocorticography is a promising signal source for the clinical application of BCIs, but it is clinically difficult to cover the whole visual cortex with high-density electrodes.This study demonstrated that ECoGs can be used to generate images precisely enough to maintain semantic information.To clinically apply communication support using ECoG-based image generation, it would be useful to investigate the optimal location of electrode implantation and the ECoG-based image generation method by comparing them with fMRI results.
The current study has several limitations.The limited number of patients included in the current case study at three institutes limits the generalizability of the results.In particular, the electrode location was not controlled to prevent analysis based on anatomical differences because electrode placement was solely determined by clinical necessity.Future studies with larger cohorts will be necessary to determine the optimal electrode location for text generation based on the imagery.In addition, although we used a movie, which is annotated by a human, to record ECoGs during various semantic stimuli, most of the previous visual semantic decoding methods have been reported for the visual stimuli of still images.To compare the results with those of previous studies, future studies will include ECoGs during still image presentation.
The proposed method represents the brain activity measured by electrocorticography in the CLIP embedding space and then uses the resulting vectors to generate text and images.In addition, visual and textual encoders that undergo multimodal learning using contrastive learning are more brain-like learning methods and extract features that are closer to actual brain activity [36].This suggests that CLIP is useful as a latent space for representing brain activity in text and images.

Figure 1 .
Figure1.Overview of our proposed text generation method.ECoGs were recorded while the subject watched movies composed of multiple scenes.Short videos were extracted from the 60 min movie every second.Then, a scene representing each 1 s short video was extracted from its middle frame.The scenes were then converted into CLIP image vectors using a CLIP image encoder.The ECoGs were used to infer the CLIP image vector corresponding to the scene.Then, CLIPCAP was applied to both the CLIP image vector and the inferred CLIP vectors to generate annotations from these vectors.The generated annotations were converted into CLIP text vectors to evaluate the scene identification accuracy.In addition, a human annotation was created in Japanese by an annotator for each image and then translated into English by Google Translate.The human annotations were also converted into a CLIP text vector and used to evaluate scene identification accuracy.

Figure 2 .
Figure 2. Overview of our proposed method for generating images from ECoGs.The annotations generated from images and ECoGs and by humans were used to generate images through StableDiffusion.The generated images were converted into CLIP image vectors to evaluate the scene identification accuracy.

Figure 3 .
Figure 3. Examples of annotations generated by CLIPCAP.Annotations generated through CLIPCAP from three scenes are shown.Human annotations were created by annotators.CLIP annotations were generated from image vectors transformed from scenes by the CLIP image encoder.ECoG annotations were generated from inferred image vectors from high-γ amplitudes of the ECoGs.

Figure 4 .
Figure 4. Comparison of scene identification accuracies for generated text vectors.Scene identification accuracies derived from human, image, and ECoG annotations were evaluated.Each box plot shows the median and interquartile range of the scene identification accuracies for true (red) and shuffled (green) scenes.* denotes p < 0.05 (Mann−Whitney U test, also see supplementary figures 4 and 5).† indicates p < 0.05 (Kruskal−Wallis with Steel-Dwass post hoc test).

Figure 5 .
Figure 5. Examples of images generated from text.(a) Three examples of images generated from text are shown.Each image was generated by StableDiffusion using three different annotations: CLIP annotation, ECoG annotation and human annotation.(b) The accuracy of CLIP vector identification for generated images was compared among human annotations, CLIP annotations, and ECoG annotations.Each box plot shows the median and interquartile range of the scene identification accuracies for true (red) and shuffled (green) scenes.The scores for the three annotations were significantly different ( †p < 0.05, Kruskal−Wallis with Steel-Dwass post hoc test).Moreover, each of them exhibited significantly greater accuracy than the corresponding shuffled vector did ( * p < 0.05, Mann−Whitney U test).