Data augmentation for invasive brain–computer interfaces based on stereo-electroencephalography (SEEG)

Objective. Deep learning is increasingly used for brain–computer interfaces (BCIs). However, the quantity of available data is sparse, especially for invasive BCIs. Data augmentation (DA) methods, such as generative models, can help to address this sparseness. However, all the existing studies on brain signals were based on convolutional neural networks and ignored the temporal dependence. This paper attempted to enhance generative models by capturing the temporal relationship from a time-series perspective. Approach. A conditional generative network (conditional transformer-based generative adversarial network (cTGAN)) based on the transformer model was proposed. The proposed method was tested using a stereo-electroencephalography (SEEG) dataset which was recorded from eight epileptic patients performing five different movements. Three other commonly used DA methods were also implemented: noise injection (NI), variational autoencoder (VAE), and conditional Wasserstein generative adversarial network with gradient penalty (cWGANGP). Using the proposed method, the artificial SEEG data was generated, and several metrics were used to compare the data quality, including visual inspection, cosine similarity (CS), Jensen–Shannon distance (JSD), and the effect on the performance of a deep learning-based classifier. Main results. Both the proposed cTGAN and the cWGANGP methods were able to generate realistic data, while NI and VAE outputted inferior samples when visualized as raw sequences and in a lower dimensional space. The cTGAN generated the best samples in terms of CS and JSD and outperformed cWGANGP significantly in enhancing the performance of a deep learning-based classifier (each of them yielding a significant improvement of 6% and 3.4%, respectively). Significance. This is the first time that DA methods have been applied to invasive BCIs based on SEEG. In addition, this study demonstrated the advantages of the model that preserves the temporal dependence from a time-series perspective.

However, achieving a model with broad generalisation requires significant data, especially when dealing with intricate architectures and numerous parameters.For instance, the GPT-3 language model, boasting 175 billion parameters, was trained on almost a trillion words gathered over a dozen years of web crawling, amounting to petabytes of data [22].A similar case can be observed in computer vision, where models such as vision transformer (ViT)-G/14, with two billion parameters, were trained on three billion training images [23].In contrast, the brain data available for BCIs, particularly invasive ones like stereo-EEG (SEEG), is sparse.The limited data stems from factors such as suitable patient availability, surgical risks, clinical prerequisites, ethical considerations, and privacy concerns, leading to a paucity of training samples.When training data is scarce, deep learning models struggle to generalise effectively to unseen instances [24].To mitigate overfitting, diverse methods have been introduced, including weight normalisation, dropout, and batch normalisation [25].Conversely, data augmentation (DA) addresses this challenge from a data volume perspective.This technique, proven to be effective in image classification [26] and natural language processing [27], generates new artificial data either through transformations of existing data or using generative models that encapsulate essential characteristics of actual data.The augmented dataset, comprising original real data and synthetic data, is then utilised to train deep learning models.This expansion of training data diminishes biases and enhances model robustness and invariance [26].
Recently, DA has been applied to non-invasive BCIs utilising EEG [28][29][30][31][32][33][34], and even to an invasive spike-based study involving animals [35].However, to the best of our knowledge, no study has explored DA in the context of invasive BCIs with human subjects.Additionally, all existing generative adversarial network (GAN) studies on EEG signals have employed CNN models, which were not optimised to capture the temporal dependencies in their studies.This paper introduced a new perspective by proposing a novel conditional generative network named conditional transformer-based GAN (cTGAN), founded on the transformer model.This approach was then compared against three other well-known DA methods: noise injection (NI), variational autoencoder (VAE), and conditional Wasserstein GAN with gradient penalty (cWGANGP), all evaluated using SEEG signals.

Related work
The prevailing DA techniques applied to brain signals can be broadly categorised into two groups: methods that employ feature transformation and those rooted in generative models [36].The first category generated new data by transferring existing data, either within the temporal or spectral domains.These techniques included NI [37,38], sliding windows [17,18], sampling [39][40][41], Fourier transformation [42,43], and recombination of segmentation [44,45].
Conversely, the generative model, which constituted the focal point of this paper, addressed the challenge by learning an unknown or intractable probability distribution from a typically small number of independent and identically distributed samples.Subsequently, the trained generative model can gauge the likelihood of a given sample and generate new samples [46].Among the generative models commonly employed for EEG data, two stand out: the VAE and the GAN.This section will introduce the core concepts of these two generative methodologies (VAE and GAN), followed by an exploration of relevant studies, with a particular focus on augmenting EEG signals.

VAE
The VAE is a variational version of the classic autoencoder (AE) network [47].In the AE paradigm, the network comprises encoder and decoder subnetworks.The encoder learns a low-dimensional representation z in the latent space for real data, while the decoder measures the distance between the generated distribution and the real one.By compelling the decoder to replicate real input samples, it learns a fixed value in the latent space for each real sample, rendering it deterministic.On the contrary, a VAE is trained to generate novel, plausible samples from a random variable drawn from a Gaussian distribution [48,49].Consequently, the VAE method can be used to generate un-seen data.Its efficacy has been evidenced across diverse domains, including source separation, finance, bio-signal applications [50], and more recently, EEG DA [30].

GAN
Another type of generative method, the GAN approaches the DA task by training two competing networks, a generator network (G) and a discriminator network (D).In this architecture, the D network is trained to maximise the probability of assigning the correct label to both real and artificial samples (generated from the G network), while simultaneously, the G is trained to generate realistic samples to confuse D.
Then, the whole network can be updated using the Jensen-Shannon divergence (JSD) [51].This method has been studied in many disciplines, such as text-to-image translation [52], and image generation [53].
To generate data corresponding to a specific class, a GAN extension, called conditional GAN or cGAN, was proposed, which gives control over the modes of data being generated [54].To do this, the inputs of both G and D of the original GAN are conditioned by concatenating with an extra target label.
Later, inspired by the superior performance of the CNN in image processing, Radford et al proposed to use the CNN to replace the multiple layer perception in the vanilla GAN.This approach was termed the deep convolutional GAN, or DCGAN [55].In this method, transposed convolution was used to generate data in a high dimension from a low-dimension latent space.
However, it was still difficult to train the GAN model and training often led to model collapse.For example, when the discriminator D is optimally trained, minimising the loss function amounts to minimising the JSD between real and generated data, and often leads to vanishing gradients as D saturates [51].To tackle this problem, an alternative GAN called Wasserstein GAN (WGAN) was proposed, which included two adjustments [56].First, the Earth-Mover distance (EMD, also called Wasserstein distance) was used, instead of the original JSD.Under mild assumptions, the Wasserstein distance is continuous and differentiable almost everywhere.Second, WGAN uses weight clipping to ensure the model parameters stay within the pre-defined ranges.
To further ease the training of the WGAN with weight clip, Gulrajani et al proposed an alternative, called gradient penalty (GP) to enforce the Lipschitz constraint [57], and this new DA method was referred to as WGAN with GP (WGAN-GP).
Besides image synthesising, GANs also can be used for time series data.In this context, not only does the distribution in the original data have to be captured, but the temporal dependency needs to be preserved in the generated data.For example, Esteban et al proposed the recurrent neural network (RNN) based GAN (RGAN), to capture the conditional distribution of the current step given the history data [63].To generate time series of a specific class, they also proposed a conditional variant, the recurrent CGAN (RCGAN).Yoon et al proposed a novel GAN variant, designed for time-series data, the TimeGAN [64].The novelty was two-fold.First, in addition to the unsupervised adversarial loss, a step-wise supervised loss using the original data as supervision was introduced to capture the dependence in the data.Second, the feature dimension in adversarial learning was reduced by learning a surrogate representation of the original data in a low-dimensional latent space, and therefore training can be more efficient.Recently, Li et al proposed a new GAN variant for time series generation, the transformer-based time-series (TTS-GAN) [65].Their model, inspired by the vision transformer (ViT) [66], used the popular transformer model as the backbone.In their implementation, time-series data was treated as an image of shape (C, H, W), in which C was the number of channels, H was the height of the image (equal to 1 for the time series data), and W was the sequence length.Similar to the ViT architecture, the time-series data were first divided into multiple patches with positional encoding along the W axis before being fed into G and D networks.By comparing to other time-series GAN models on multiple time-series data sets, they showed that the transformer-based TTS-GAN was superior in generating realistic time-series data.However, their model introduced substantial high-frequency noise into the generated data.

GAN on EEG signals
Although not as intensively investigated as in image processing, several EEG studies have demonstrated improvements brought by DA using GAN models.For example, using a conditional DCGAN (cDCGAN), Zhang and Liu managed to increase the binary EEG classification accuracy from 83% to 86% [67].A similar result was also reported by Fahimi et al using cDCGAN [33].Luo et al applied a conditional WGAN on two public EEG datasets, including the SJTU emotion EEG dataset (SEED) and DEAP datasets [29].They obtained a 2.97% enhancement on the SEED dataset, and 9.15% and 20.13% enhancement of the arousal and valence classification on the DEAP dataset.Next, the same group proposed the conditional boundary equilibrium GAN and achieved 6% and 10% enhancement on two popular EEG datasets of emotion recognition (SEED and SEED V) [68].WGAN-GP has also been investigated.For example, WGAN-GP was used by Wei et al on the CHB-MIT scalp EEG database for seizure detection [69].In their work, using data generated from the other 22 patients, they obtained 3% detection enhancement on the rest subjects.Aznan et al used WGAN-GP to enhance the performance of the steady-state visual invoked potential BCI [30].Hartmann et al further proposed to improve the WGAN-GP by gradually relaxing the gradient constraint and demonstrated that the proposed improvement was more stable and superior to the original WGAN-GP [28].Besides aforementioned non-invasive studies, DA has also been investigated on invasive neural signals.For example, to enhance epileptic seizure detection using scalp electroencephalogram and intracranial electroencephalogram, a random selection strategy was used to tackle the problem of sample imbalance [70].Another study used a similar random selection strategy to enhance the classification of seizure onset zone (SOZ) [71].It was even possible to augment invasive data using noninvasive data, as demonstrated in a recent study [34], Figure 1.The workflow of this paper.First, an artificial data set the same size as the original data was generated using four generative methods.Then, various methods were used to evaluate the generated data, including visual inspection, cosine similarity (CS) and Jensen-Shannon distance (JSD).Next, the augmented data set was used to train a classification network (deep convNet) and the classification accuracy was obtained.In addition, classification accuracy without augmentation was also calculated for comparison, as shown in the upper right of the plot.
where the invasive SEEG signals were augmented using simultaneously recorded scalp EEG signals.
All aforementioned EEG studies implemented the GAN based on a CNN architecture.Although the CNN architecture has achieved success in various decoding tasks [12,72,73], they are not optimised from a time-series perspective.This paper proposed the cTGAN network that used the transformer encoder as the backbone for both generator and discriminator networks [74].Similar to other implementations, conditional generation was achieved by providing the target label to the generator network.
In order to demonstrate its performance, a comparative study was performed between the proposed method and three other commonly used DA methods, including NI, VAE, and cWGANGP.The recently proposed transformer-based TTS-GAN was not implemented in this manuscript, because of obvious high-frequency artefacts presented in the generated data [75].These selected methods were then used to augment an SEEG dataset, which was collected while participants performed five forearm or hand motions (five categories).The task in this paper was two-fold: first, generating realistic SEEG data; second, enhancing the five-category classification accuracy.In the first task, various metrics were used to evaluate the quality of generated data, including visual inspection, cosine similarity (CS) and JSD.In the second task, the five movements classification accuracy was obtained by a deep learning-based classifier before and after DA using the above four DA methods to compare their effects on the performance of a classifier.The main workflow was presented in figure 1.In the following section, detailed information on the SEEG data and the implementation of these four methods was presented.
The novelty of this manuscript is three-folded: • To our best knowledge, this is the first time that DA has been studied for SEEG data.• Unlike previous EEG studies that used the CNNbased augmentation method, the transformer model used in this manuscript captured the temporal dependence from a time-series perspective.• To address the high-frequency artefacts encountered in TTS-GAN, a CNN-based filter was added to further regulate the generated signals.

Methodology
The SEEG data, proposed cTGAN method, and three other commonly used methods (NI, VAE, and cWGANGP) were introduced in this section.

Data description
The SEEG data used in this paper were acquired during a previous study in which participants performed five different hand or forearm movements [76].There were eight human participants (referred to as 1, 2. ..,8).The participants were patients with intractable epilepsy and were implanted with SEEG electrodes for pre-surgical assessment of seizure focus.All participants were enrolled with written consent.The clinical profile of the selected eight participants was shown in table 1.All implantation parameters were determined solely by clinical needs.SEEG signals were acquired using a clinical recording system (EEG-1200C, Nihon Kohden, Irvine, CA) and sampled at 1000 or 2000 Hz.Each electrode shaft was 0.8 mm in diameter with 8-16 contacts (Huake Hengsheng Medical Corp., Beijing, CN).This study was reviewed and approved by the Ethical Committee of the University of Bath (ethical approval reference: EP 20/21 050) and the Ethics Committee of Huashan Hospital (Shanghai, China) (ethical approval reference : KY2019518).

Experimental protocol
The detailed experimental paradigm was explained in the previous study [76].In brief, the participants were reclined on a hospital bed during the whole experiment.Each trial lasted for 10 s (4 s rest, 1 s cue, and 5 s task).To begin the trial, the participants kept still for 4 s (resting stage).Then, a visual cue (a cross) was shown on a screen for 1 s (cue stage).When the cue stage ended, the cross disappeared and a picture of one of five tasks was presented (grasp, scissor gesture, elbow flexion, wrist supination, thumb flexion).The participant performed the specified task repeatedly for 5 s, using the hand of the participant contra-lateral to the hemisphere with the majority of the implanted SEEG electrodes.The five tasks were randomly presented for a total of 20 times per task.In total, there were 100 trials per participant (16.67 min total).

Signal pre-processing
First, the SEEG data were down-sampled to 1000 Hz, if necessary, before band-pass filtered from 0.5 Hz to 400 Hz using a 4th order Butterworth filter.Then, a notch filter was used to eliminate 50 Hz line noise.Next, channels with extensive line noise were identified and excluded using the same method from our previous work [5].

Channel selection
Unlike other paradigms, SEEG electrodes record signals from distributed cortical and sub-cortical areas.While only some of them contain motor-related signals, most of them record unrelated neural activities (noise).Thus, channel selection was performed to ensure that DA was conducted only on relevant channels.This procedure selected channels that show high correlations with the motor state, similar to previous invasive BCIs studies [77,78].In brief, the temporalspectral representation was calculated, and then, with visual inspection, channels that exhibit strong eventrelated synchronisation (ERS) or event-related desynchronisation (ERD) were selected.There were 10, 8, 12, 10, 15, 12, 10, 10 channels were selected for subject 1-8, respectively.

Proposed cTGAN
This section introduced the proposed cTGAN, and elaborated on the code implementation.
The vanilla WGAN using weight clipping can be formulated as in equation ( 1): where D is a set of one-Lipschitz functions, X r and X g are real and generated data, respectively, and θ G and θ D are parameters of the generator and discriminator network.To enforce the Lipschitz constraint, the weights of D are clipped within With the alternative GP to enforce the Lipschitz constraint, the WGAN-GP can be reformulated as equation (2): where x r ∈ X r and x g ∈ X g represent the real and generated samples, and the hyperparameter λ balances the original loss and the gradient penalty.The variable x represents samples from X which is the interpolation between real distribution X r and artificial distribution X g , as in equation ( 3): In our proposed cTGAN method, three modifications were made to the previous WGAN-GP formulation.First, both generator and discriminator networks were implemented using the transformer encoder network.Second, the conditional generation was achieved by supplying a target label y, along with a random variable z to the generator.Finally, in addition to the original Wasserstein distance, an extra categorical classification error of real and generated samples was also calculated and added to the loss function in order to capture distinguishable features among classes.These two types of error were achieved using a dual-headed output design of the discriminator.The first head (denoted as H 1 ) represented the one-Lipschitz functions, while the second head (denoted as H 2 ) represented the five-class categorical loss.
A simplified schematic plot of the proposed cTGAN was presented in figure 2. In this architecture, a concatenation of random noise and a target class label was fed into the generator network G to produce artificial data belonging to a certain class.Then, both the generated and real data were input into the discriminator network D. The discriminator loss was then calculated to update the G and D network alternately.
The transformer encoder network, shown on the right side of figure 2, was used in the proposed method to capture the temporal dependence within the sequence.The input to the transformer were continuous variables, i.e. time-series variables.The transformer used a self-attention mechanism to correlate different positions of a single sequence to compute a representation of the sequence [74].Specifically, attention paid by one particular element towards all other elements was calculated using a query and key mapping.Therefore, the temporal dependence among different elements in the sequence can be preserved by learning different attention scores (alignment) among them.The transformer encoders in both D and G shared the same structure which included three stacked layers and five heads with a dropout rate of 0.5.The sine and cosine functions of different frequencies were used for positional encoding.The rest of the implementations were kept the same as the original transformer paper [79].
With the above modifications, the new loss function for the discriminator and generator can be rewritten as in equations ( 4) and ( 5): where H 1 and H 2 denote the two output heads, x r ∈ X r and x∼ X represent the real and interpolation samples, and z ∈ N is a Gaussian random variable.L calculates the categorical classification loss (cross-entropy) between the real sample labels and the second head output of the discriminator (H 2 ).Hyper-parameters λ and η balance the weight of gradient penalty and classification loss, with respect to the Wasserstein distance, respectively.The detailed implementation of both the generator and discriminator can be found in figure 3, in which the data shape was enclosed by a bracket.In this figure, the original data for the discriminator is in the shape of [batch size, channel number, sequence length] (for example, [b,10, 500] for subject 1 with ten channels).
In the generator, a label number was first embedded in a vector using a linear transformation, then it was concatenated with a noise vector of the same shape drawn from a Gaussian distribution.Then a transformer encoder, described in the right side of figure 2, was used to generate new data.However, the preliminary experiment found the generated data was noisier and contained more high-frequency components than the original data.Therefore, an extra 1D Figure 3.The architecture of the generator (right) and the discriminator network (left) of the proposed cTGAN method.The output shape of each layer was denoted in the brackets below the output.The example data was taken from participant 1 with ten channels.The 1D filter was implemented with a 2D convolution layer that shared parameters within the kernel.Abbreviation: b: batch size; dim: dimension; permut:permutation; emb: embedding; concat: concatenate; prob: probability; .filter was added in the last layer.This 1D filter was implemented using the convolutional layer (Conv2d operation), where the kernels shared parameters by assigning one kernel parameter to other kernels in every training iteration.To guide the generator to suppress the high-frequency component, the kernel was initialised with a bior 6.8 wavelet.In the discriminator (left subplot of figure 2), for input with the shape of [batches, 1, channels, sequence length] ([b,1, 10, 500] for subject 1), the process can be divided into two parts.The first part depicted in the upper part of the right sub-plot was the same as in the ViT architecture [66], and it contained three main steps: patching the input along the temporal axis (last axis), concatenation with an extra CLS token, and positional encoding.While the second part depicted in the lower part contained two output heads: the binary adversarial loss (H 1 ) and five-class categorical loss (H 2 ).

Baseline methods
Alongside the proposed method, three other DA methods were implemented in this paper: NI, VAE and the cWGANGP.This sub-section presented detailed information on these methods.

NI
Adding noise is a common practice to augment training data in computer vision.However, for the EEG time-series data with a low signal-to-noise ratio, adding noise directly might destroy the amplitude and phase information, and Li et al demonstrated that a technique called amplitude-perturbation DA can be used to boost the EEG decoding performance [38].Therefore, in this work, rather than adding noise to Algorithm: Noise Injection using amplitude perturbation Input: X in the shape of (N, C, T), where N, C and T represent the trial number, channel number and sequence length, respectively.Output: X (same shape of the input); for n = 1, 2, . . .N do; for c = 1, 2, . . .C do; Denote the signal from c-th electrode of Xn,c as x(t); Calculate STFT and add perturbations to the amplitude; Generate new data x by inverse STFT; Assign x to Xn,c end end Return the generated X; N is the sub-trial number after sliding window operation, described in section 2.4.
the time series directly, the SEEG data were first transformed into the spectral domain, and perturbation was introduced to the spectral amplitude while the phase information was preserved.Then, new data was generated by transforming the modified signal back to the time domain.Gaussian noise with a mean of zero and standard deviation of 1 × 10 −3 , as suggested in other work, was used in this paper [38].A detailed pseudo-code implementation of this method can be found in table 2.

VAE
The VAE implemented in this paper was taken from previous work [30].Briefly, the encoder consists of a 1D convolutional layer, batch normalisation, and max-pooling layers, while the decoder consists of four 1D transpose convolution layers.The parametric rectified linear unit was used as the activation function except for the last layer in the decoder where a sigmoid function was used.
An Adam optimizer with a learning rate of 0.0001, beta 1 of 0.1, and beta 2 of 0.999 was used to update the model parameters [30].
Since there was no discriminator in the VAE model, the conditional version of VAE was not implemented in this work.Instead, the VAE was trained for different classes separately.

cWGANGP
The cWGANGP method consisted of a generator and a discriminator.The generator was similar to [33], except that the deconvolution layers (ConvTranspose1d in Pytorch) were used in the generator model.
The discriminator was a deep CNN (deep con-vNet) taken from another work [17].Conditional generation was achieved by providing the embedded label information to both the generator and the discriminator [33].

DA model training process
In this work, a participant-specific augmentation model was trained for each participant separately (within-participant augmentation).This training procedure was taken for two main reasons.First, the channel number was different, which means that a model trained on one participant was not suitable for others.Second, the electrode locations varied among participants, and therefore the recorded signals reflected different neural activity.
In addition, since there were only 100 trials for each participant, deep learning (including the generative DA model) will not perform well.Therefore, before conducting the experiment using the proposed and baseline methods, a sliding-window strategy was adopted to augment the SEEG data as a starting point [17].The window and sliding step were, respectively, set to 500 ms and 100 ms (sampling rate of 1000 Hz) [80].This windowing process split the original trials (10 s) into multiple shorter subtrials (500 ms).Thus, there were 2375 sub-trials for each subject, organised in the shape of [N, C, T] ([2375, C, 500]) where C represented the channel number.
After windowing, the entire dataset was used to train the proposed and baseline methods, i.e. no training/validation/testing partitioning.The model parameters were initialised randomly for all methods unless explicitly stated.The learning rate was 0.0001 for VAE and 0.0002 for the other methods.λ and η from equations ( 4) and ( 5) were set to 1. Coefficients used for computing running averages of the gradient and its square (beta1, beta2) were set to (0.9,0.9), while the dropout ratio was 0.5.The same latent space dimension (random noise) 512 was used for VAE and cTGAN, and cWGANGP.All models were trained using the Adam optimiser and the model scale was presented in table 3. The training was implemented using Pytorch [81] on a computer with an Intel(R) 239 Xeon(R) Gold 5118 CPU, 64.0 GB RAM, and one NVIDIA Quadro 240 P5000 GPU card.The average training time of the DA models was also provided (in minutes).

Model selection
EEG signals (invasive or non-invasive) are nonstationary and typically have a low signal-to-noise ratio, which makes it difficult to evaluate signal quality by visual inspection.This is particularly the case when there are many electrodes, such as SEEG, which normally contains one to two hundred electrodes.In this paper, the Wasserstein distance (equation ( 1)) was used to reflect the generated data quality.Since the Wasserstein distance indicates the distance from the generated data to the actual data, the best model can be identified corresponding to the minimum Wasserstein distance.

Quality evaluations
Unlike image processing, in which the data quality can often be visually evaluated, SEEG signals are intrinsically non-stationary.In this work, four methods were used to evaluate the generated samples, including one visual method (in a lower dimension) and three quantitative methods.The first three methods were visual inspection in two-dimension using the t-distributed stochastic neighbour embedding (t-SNE) technique [82], and two quantitative evaluation methods including CS, and JSD.Finally, the five-movement classification accuracy was obtained before and after the DA to indicate the enhancement of each DA method on a deep learning-based classifier.A detailed description of these methods was provided below.

Visual inspection with t-SNE
t-SNE is a technique to visualise the distribution of high-dimensional data in a low-dimensional space.
Data points that are similar in the original space will be clustered together in the low-dimension space.Therefore, it can be used to evaluate how close the generated data is to the original data.

CS
CS is a distance measurement.In this method, the cosine of two vectors was represented using the production of two vectors x and y, as in equation (6) x • y = ∥x∥ ∥y∥ cos (θ) (6) which gives the CS between vector x and y in equation ( 7): The resulting similarity ranges from negative 1 (opposite direction) to positive 1 (the same direction), with 0 indicating orthogonality.Therefore, the larger the value of CS, the higher the quality of the generated data.

JSD
The JSD was defined based on the Kullback-Leibler divergence (KLD) and was used as the loss function in the first GAN paper [51].In this method, for probability distribution P and Q, which are both defined in probability space X , the KLD from Q to P can be defined as: However, since D KL (Q∥P) ̸ = D KL (P∥Q), the KLD is not symmetric.Therefore, the KLD is not an ideal distance measurement.On the other hand, JSD is a symmetric and smoothed version of the KLD, which can be defined as: where M = 1 2 (P + Q).As a distance measurement between the real and generated data, the smaller the value of JSD, the higher the quality of the generated data.

Decoding accuracy of a deep learning-based classifier
The final task was to classify SEEG signals into five movements using a general deep convNet classifier [17].The classification accuracy was obtained as the ratio between the number of correctly identified trails and total trials.In this last method, the classification accuracy, obtained before and after DA using four methods, was used to indicate their effect on the classifier.
Unlike the previously described training procedure for the DA methods, which used the whole dataset to generate the artificial data (as stated in section 2.4), this experiment only used the training dataset (without the testing dataset).This was to prevent possible data leakage and reduce bias in the reported decoding accuracy.Using the five-fold cross-validation, the whole dataset was partitioned into training/validation/testing in a 60/20/20 manner.To finalise the partitioning, a windowing procedure, as in section 2.4, was performed for each dataset to split the 10 s trials into shorter 500 ms sub-trials.
After partitioning, the training of each fold contained three steps: the training of the deep convNet without DA, the training of a particular generative model using the training dataset and the training of the deep convNet using augmented data.In the first step, the classifier was trained on the 60% training set, while validated and tested on the 20% validation and 20% testing dataset, respectively.In the second step, the generative model was trained on the training dataset (60% of the whole dataset), after which the same amount of data as the training set was generated by the trained generative models.In the final step, the augmented training set consisting of the original training set and the generated data were used to train the deep convNet, while the original validation set (20%) and testing set (20%) were used for earlystopping and obtaining the final decoding accuracy, respectively.
For example, the original data contained 2375 sub-trials after the windowing operation, with shape Finally, the deep convNet was trained on the augmented training set (2850 sub-trials).In the end, the mean decoding accuracy averaged across folds was reported.

Statistical analysis
Statistical analysis was performed using IBM SPSS version 26 (IBM Corp., New York).A factorial analysis of variance (ANOVA) test was conducted to compare the main effects of the DA methods on decoding accuracy.There are two main factors in the testing, including the DA method, which has four levels corresponding to four different DA methods and the participant which has eight levels corresponding to eight participants.An independent t-test was used to compare the CS and JSD scores obtained by different DA methods.The significance was assessed at P < .05.

Results
In this section, the artificial data generated with these methods was presented and the data quality and classification accuracy were evaluated.

Model selection
In both cTGAN and cWGANGP, the discriminator loss represents the distance between real data and generated data: a smaller distance means bettergenerated data.Therefore, the model corresponding to iteration when EMD was approximately zero was selected.The example training processes of cWGANGP and the proposed cTGAN from subject 1 were presented in figure 4. Clearly, both methods had achieved convergence at the end of the training.Therefore, the final model was identified as the one corresponding to iterations of 40 000 and 25 000 for cWGANGP and the proposed cTGAN, respectively.

Visual inspection
Using models identified in the previous section, examples of generated data and the corresponding t-SNE plots were presented in the upper part of figure 5, with a real data example presented the lower part.This plot demonstrated that data generated with NI was clearly a noisier version of the original data, and the t-SNE plot exhibited considerable overlap between the real and artificial data.The VAE, as illustrated in the second column, also generated low-quality data as shown in the corresponding t-SNE plot.On the other hand, both cTGAN and cWGANGP methods generated appealing samples, and it was difficult to make a comparison by visual inspection.Specifically, the data distributions of both original and generated data were very similar, as indicated in the t-SNE plot, and an independent ttest also did not find significant differences (P < .05).Next, additional quantitative methods were used to better compare these methods.

CS & JSD metrics
Two quantitative indexes, including the CS and JSD, were used in this section to compare data generated using four methods.These two measurements were calculated between the real and the generated samples for each sub-trial and the obtained average value was presented in table 4. The proposed cTGAN achieved significantly better performance for both CS and JSD metrics (higher CS and lower JSD, p < 0.001).Notice that the relatively high CS and low JSD of the NI method reflected the considerable overlap between real and artificial data as shown in figure 5. On the other hand, VAE performed the worst in both CS and JSD metrics.

Effect on the classification accuracy
This section evaluated these four DA methods according to their effects on classification accuracy obtained by the deep convNet model, in which the classification accuracy was obtained before and after DA.This procedure was conducted on different subjects separately using the cross-validation method presented in section 2.6.4.
The final decoding accuracy of eight subjects was presented in figure 6, while more details can be found in table 5.
A factorial ANOVA was conducted to compare the main effects of DA methods on decoding accuracy.Using the proposed transformer-based GAN model (cTGAN) and the cWGANGP, the decoding accuracy yielded a significant improvement of 6% and 3.4% (average across eight participants), respectively, while no significant improvement was found using NI and VAE.In addition, the ANOVA test showed a significant difference between the proposed cTGAN and cWGANGP with a p-value of 0.011, which means that the proposed method outperformed cWGANGP significantly.In detail, the DA methods factor yielded an effect size of 0.096, indicating that 9.6% of the variance in the decoding accuracy was explained by the DA method (F(1,64) = 6.813, p = 0.011).

Discussion
This paper presented a transformer-based generative model for artificial data generation and performance enhancement of a deep learning-based classifier.The proposed method is different from other EEG studies using a CNN or RNN-based generative model.In this work, the transformer embedded in the proposed cTGAN method uses a self-attention mechanism to correlate different positions and calculate the attention paid by one particular element towards all other elements using a query and key mapping.Therefore, the temporal dependence was preserved by learning different attention scores (alignment) within the sequence.The comparative experiment conducted in this paper demonstrated the superiority of the model in generating high-quality data and boosting the performance of a deep learning-based classifier.
Next, this section discussed the model collapse problem encountered during the experiment, as well as an alternative loss function to train the proposed cTGAN model.

Mixed classification performance
The classification accuracy varied among the eight subjects, as shown in figure 6.This mixed decoding accuracy resulted mainly from the location of the different electrodes inside the brain.All eight X Wu et al      subjects were epileptic patients requiring intracranial monitoring, and most electrodes were inserted into deep regions to capture unusual neural activities from different targets.Therefore, the selected task-related channels were different among subjects.For example, a preliminary examination showed that the ERS/ERD recorded from subject 8 was much stronger compared with other subjects and this subject had more selected electrodes located in motor-related areas (pre-central and post-central regions).For subject 5, the ERS/ERD was much weaker and none of the selected electrodes was located in the aforementioned pre-central and post-central regions.
One possible method to decrease the variance is to extract common features by preprocessing the data first.For example, the multiscale principal component analysis method can be used for noise removal [72,[83][84][85].In addition, it is also possible to identify the existing variance by comparing the underlying patterns of EEG signals by looking at the graphical features of different subjects [86,87].

The extra filter layer
As another transformer-based GAN study showed that the transformer alone led to obvious highfrequency artefacts in the generated data [75], an extra filter layer was added to the transformer model to further regulate the generated data in the spectral domain.On the other hand, it has been demonstrated that a CNN can act as a filter in EEG decoding tasks [73,80].This has been validated with an ablation study by checking the spectral contents of data generated with and without the filter layer.In this ablation study, new data were generated using models with and without the filter layer following the same training strategy in the main section.The experiment was conducted on participant 8 and the result was presented in figure 7.This figure showed that our cTGAN has generated very similar spectral content to the original real signal.However, the TTS-GAN has generated data with more high-frequency components, such that the PSD from the TTS-GAN did not obey the 1/f law any more.

Model collapse
GAN models are difficult to train and are notorious for model collapse.Most studies train the network for a predefined number of iterations [28-31, 34, 68], while others train the model until convergence [33].For the first training strategy, model convergence cannot be guaranteed and hence sub-optimal data will be generated.For the second strategy, even though convergence was achieved, there is still a risk of model collapse.To demonstrate the potential risk in the second strategy, this section reported a training result using the cWGANGP method encountered during the experiment.The generator used in this experiment was similar to the one used in the main section but with two transposed convolutional layers instead of three.The training was conducted on participant 1.The training process and the generated data are shown in figure 8.It was clear that the training had stabilised and achieved convergence at the end of the training.Then, artificial data was generated using a model corresponding to iteration 4000.However, the generated data (shown in the right subplot of figure 8) was clearly meaningless by visual inspection.This example demonstrated that the convergence of the generative models was necessary but not sufficient for high-quality data generation.

Loss function
Another critical aspect of the GAN model is the loss function, and the most popular loss function is the Wasserstein distance.In GAN, the discriminator is trained to differentiate between real and fake samples, and the loss is calculated as the classification error made by the discriminator (the adversarial loss).When the generator outputs sub-optimal samples, D can differentiate these two samples with high accuracy (low adversarial loss).On the other hand, if G outputs realistic samples, D can only perform at around the chance level (50% accuracy), hence the high adversarial loss.Therefore, the adversarial loss can be used to reflect the data quality and update the GAN model.However, to our knowledge, there is no GAN study on EEG signals reported using this intuitive loss function.Therefore, this section evaluated the adversarial loss in the proposed cTGAN method.Similar to the loss used in the main section, an extra categorical classification (five-class) loss was also calculated and added to the adversarial loss function.
To obtain this new loss, the two output heads of the discriminator in figure 3 can be implemented as in figure 9.
In figure 9, the first head, H1, which was used to calculate the Wasserstein loss as in the main section, was now revised to calculate the adversarial classification loss (real or fake, binary classification), while the second head, H2, was used to calculate the categorical classification loss (five movement classes) which was the same as in the main section.The revised loss function for the discriminator now can be implemented as equation (10): while the loss function for the generator can be revised as : in which the real_labels and gen_labels were adversarial labels ([0, 1], 0 represented generated data while 1 represented real data), the category_real and category_gen were categorical labels ([0,1,2,3,4], denoted five categories), while cat_real and cat_gen denoted the network output (logit) of real and generated input, respectively.L 2 computed the cross entropy loss between predictions and targets and was implemented with CrossEntropyLoss in Pytorch, while L 1 computed the binary version of L 2 and it was implemented with BCEWithLogitsLoss.During the training, the network tried to minimise L D and maximise L G .For model selection, four terms were monitored during the training process: the binary adversarial classification accuracy of the real and generated data:  acc_adv_real and acc_adv_gen, the five-category classification accuracy of the real and generated data: acc_cat_real and acc_cat_gen.An example training process of participant 1 using the same training procedure from section 2.4 was presented in figure 10.In this plot, the discriminator achieved a binary adversarial classification accuracy of approximately 60% for both real and generated data (acc_adv_real and acc_adv_gen) at iterations marked by the dotted circle.According to the rationale stated above, this means the discriminator was unable to distinguish real and generated data.Therefore, a candidate model was identified at the iteration of around 21 000.
To evaluate this modified loss function, the effect of DA using this loss on the classification performance of the deep convNet was obtained again for all participants, following the same training procedure as in 2.6.4.However, there was no significant enhancement in five-category classification accuracy compared to that achieved using the original data set.
The possible reason for this inferior result can be two-fold.First, the discriminator might not be properly trained, as indicated by the acc_cat_real line in figure 10 showing a relatively low accuracy.Second, with this modified loss function, there was no guarantee that high-quality data would be generated because the loss function was not a direct measure of the distance from the real data.
In summary, this section showed that the training of the proposed cTGAN using the intuitive adversarial loss was more challenging than that using the Wasserstein distance.

Limitation and future work
The proposed method has only been tested on a limited number of epilepsy patients.To obtain wide acceptance, it is necessary to test on a large and heterogeneous dataset.For example, the EEGNet model was validated on four BCI datasets: P300 visualevoked potentials, error-related negativity responses, movement-related cortical potentials, and sensorymotor rhythms [73].In another comprehensive study, the Giga Science EEG dataset with 52 subjects was used during the evaluation [88].Therefore, to fully evaluate the proposed method, more data from different paradigms is needed.Besides, although this manuscript explained how the transformer preserves the temporal dependence, it failed to explain why the transformer-based GAN model was superior to the CNN-based models.
Next, it is very difficult to do an inter-subject augmentation as in non-invasive EEG studies.This is mainly because SEEG signals recorded from different subjects are very different.These differences stem from two aspects.The first one is the implantation location.The subjects in this study are epileptic patients who have SEEG implanted in the possible SOZ during the seizure monitoring.Since the SOZs are not the same, the recorded signals reflect very different biological and physical processes.Second, the number of implanted electrodes in this study are also different.In contrast, for scalp EEG signals, the signals are recorded from the same location and reflect similar biological and physical processes in most studies that follow a standard electrode placement (such as the 10-20 International System of Electrode Placement).
Finally, as stated in the main section, the data outputted by the generator network G was further regulated (filtered) by a CNN layer, and the initialisation of this layer can have an impact on the data quality.While this layer was explicitly initialised as a low-pass filter because of the existence of noisy high-frequency components, the choice of a bior 6.8 wavelet was not fully verified.The bior 6.8 wavelet was chosen because of its low-pass frequency response.It is preferential that the generator network G can itself learn to output signals similar to the original data.One possible solution is to adjust the transformer hyper-parameters to capture long-range temporal dependence rather than the local fluctuations.

Conclusion
This paper conducted a novel study of generative DA using invasive SEEG signals collected from eight participants.Four DA methods were evaluated in this study, including NI, VAE, cWGANGP, and the proposed cTGAN.Using various evaluation metrics, this paper demonstrated the proposed cTGAN exhibited superior performance in generating high-quality artificial SEEG data and improving the performance of a deep learning-based classifier.

Figure 2 .
Figure 2. Plot of the cTGAN architecture (left) and the transformer encoder block (right) used in this work.The two coloured lines in the left part represent two adversarial training processes.The encoder (without decoder) from the encoder-decoder architecture of the transformer was used in the proposed method, which was implemented with three stacked layers and five heads with a dropout rate of 0.5 for both D and G networks.

Figure 4 .
Figure 4.The example discriminator loss of cWGANGP (left) and cTGAN (right) for participant 1.Both methods obtained convergence at the end of the training.

Figure 5 .
Figure 5. Visualisation of the data generated by four methods for participant 1.Only three channels were plotted for viewing.The upper four columns exhibited data generated using NI, VAE, cWGANGP and the proposed cTGAN method, with the raw sequence on top and the t-SNE plot below.Real sample data taken from the same three channels was also presented in the lower part.

Figure 6 .
Figure6.Classification accuracy obtained before and after data augmentation using four generative models.A violin plot was used to show the decoding accuracy distribution, in which the top, bottom, and middle bars represent the minimum, maximum, and mean accuracy, respectively.S1-S8 represented participant 1-8.

Figure 7 .
Figure 7.Example power spectral density plots of signals generated using TTS-GAN and cTGAN on participant 1.Three lines were calculated using data generated from TTS-GAN, cTAN and real data.The y-axis is log-transformed to have a better visualisation.

Figure 8 .
Figure 8.The training process and the generated samples using the cWGANGP model on participant 1.Two example channels, generated by a model corresponding to iteration 4000, are presented.

Figure 9 .
Figure9.Modified two heads of the discriminator network.H1 and H2 represent the binary adversarial (real or fake) and categorical classifications, respectively.adv_real and adv_gen denote the H1 output of the real and generated input which are then used to calculate the binary classification accuracy acc_adv_real and acc_adv_gen respectively, while cat_real and cat_gen represent H2 output of the real and generated input which are then used to calculate the categorical classification accuracy acc_cat_real and acc_cat_gen, respectively.

Figure 10 .
Figure 10.Training process of the proposed cTGAN using the adversarial loss function for participant 1.The adversarial classification accuracy of both real and generated data was approximately 60% around iterations marked by the black dashed circle.

Table 1 .
Clinical profiles of participants in the study.
Abbreviations for this table: SID: participant ID; EZ, epileptogenic zone; RH, recording hemisphere; BI, bilateral; SR, sampling rate; SMA, supplementary motor area; EL, number of electrode shafts; NC: number of contacts; DH, dominant hand; EH, experiment hand.

Table 2 .
Implementation step of the noise injection method.

Table 3 .
Size and training time of different models.

Table 4 .
Mean and standard deviation score of CS and JSD obtained by different DA methods (significance level at .05).

Table 5 .
Categorical classification accuracy obtained using different DA methods (significance level at .05).
Each row represents one participant.Accuracy was reported in mean (standard deviation) format.