Adversarial sample detection for EEG-based brain-computer interfaces

Deep neural networks (DNNs) play a pivotal role within the domain of brain-computer interfaces (BCIs). Nevertheless, DNNs are demonstrated to exhibit susceptibility to adversarial attacks. In BCIs, researchers have been concerned about the security of DNNs and have devised various adversarial defense methods to resist adversarial attacks. However, most defense methods encounter performance degradation when dealing with normal samples due to changes in the original model. As an alternative strategy, adversarial detection aims to devise additional modules or use statistical properties to identify potentially adversarial samples without changing the original model. Hence, the present study provides a comprehensive evaluation of several typical adversarial detection methods applied to EEG datasets. The experiments indicate that the detection method based on the kernel density estimation (KDE) shows the best performance under various adversarial attacks.


Introduction
Brain-computer interfaces (BCIs) have made remarkable advances in many fields, such as motor rehabilitation, neurological disease treatment, and virtual reality control.Deep neural networks (DNNs), due to their robust end-to-end representation learning capabilities, have gained widespread application across various domains, including EEG-based BCIs.However, DNNs have shown susceptibility to adversarial attacks [1].Zhang et al. [2] demonstrated that adversarial attacks can also lead to misjudgment by DNNs utilized in BCIs, which poses significant security risks for subjects.
In addressing adversarial threats in EEG-based BCIs, Meng et al. [3] assessed the effectiveness of some classical adversarial defense strategies on two EEG datasets.In recent years, adversarial detection has attracted much attention as an alternative solution.It aims to devise other modules or use statistical properties to identify potential adversarial samples rather than enhancing the model's ability to correctly classify adversarial samples.Grosse et al. [4] introduced an additional class into the classifier specifically designed for adversarial samples.Feinman et al. [5] proposed two distribution estimation methods to detect adversarial samples, including Bayesian uncertainty estimation and kernel density estimation, and they argued that adversarial samples exhibit a distribution distinct from that of normal samples.
Considering the distinctions between BCIs and other applications, the effectiveness of these adversarial detection methods cannot be guaranteed.Therefore, in this paper, we comprehensively evaluate the effectiveness of several typical adversarial detectors on two EEG datasets under various adversarial attacks.

FGSM:
The fast gradient sign method (FGSM) [1] is extensively employed for generating adversarial samples.Considering a normal sample , this method generates an adversarial sample  * through the following formula: where  represents the strength of the perturbation, and ℒ(, ) denotes the loss incurred during training.

PGD:
The project gradient descent (PGD) [6] aims to find the most effective adversarial sample by iteratively perturbing the input data within the strength constraint .The function of this method can be described as: where  0 * is a constrained random initial value, clip , (•) is a clipping function to limit the perturbation, and  represents the step size of each iteration.C&W: The Carlini & Wagner (C&W) attack [7] aims to find the smallest perturbation that ensures misclassification.This objective is described as follows: where  is an auxiliary variable,  determines the number of binary searches, and (•) is denoted as: where the true label of  is denoted as ,  controls the confidence, and () represents the output before the softmax layer.

Threat models
To conduct a more complete evaluation, in our experiments, we applied three threat models introduced in [8].The classification model and detector are defined as  and , respectively.

Strong attack (Zero-Knowledge):
The attacker operates without awareness of  , creating adversarial samples by using the unprotected parameters .

White-box attack (Perfect-Knowledge):
The attacker possesses knowledge about the scheme and parameters of , which enables them to devise tailored methods to assail  and  simultaneously.

Black-box attack (Limited-Knowledge):
The attacker executes the attack with knowledge limited to the scheme of , without information about the parameters of both  and .

Detection methods
The following four detection methods were used in our experiments.
Confidence: As a direct candidate, the predicted probability (  | |  ) for a sample  belonging to Class  is often used as the confidence score in predictions.Although adversarial samples can fool the model with high confidence [1], for more extensive detection and evaluation, we identify lowconfidence samples as adversarial examples following the setting in [12].
Predictive entropy (PE): Predictive entropy (PE) is a measure of uncertainty or disorder associated with the predictions of a model.It is calculated by using the formula: where (  | |  ) represents the predictive entropy, and (   | |  ) denotes the conditional probability of Outcome   given by Input .In [9], it was observed that adversarial samples have higher PE values.

Bayesian uncertainty estimation (BUE):
The DNNs with dropout can be regarded as a Bayesian approximation of the Gaussian process [5].Hence, the uncertainty of the output can be further captured through the following formula.
where  is a given sample,  ̂ is the -th prediction from Monte Carlo sampling, and  is the number of Monte Carlo samplings.In [5], it was observed that adversarial samples have higher BUE values.

Kernel density estimation (KDE):
Based on the intuition that adversarial samples stem from a distinct distribution compared to normal samples, the likelihood of a given sample  can be estimated through KDE [5].
where   is the training set with Label ,  is the bandwidth, and  −1 () is the feature output of the final layer of the model.KDE is suitable for distinguishing adversarial samples situated far from the normal samples, with lower KDE values indicating such instances [5].

Experimental setup
In assessing the classification model's performance, we incorporated two metrics as described in [10]: balanced classification accuracy (BCA) for the P300 dataset and raw classification accuracy (RCA) for the MI dataset.For the strong and white-box threat models, we employed EEGNet [10] as both the attack and defensive models across both datasets.For the black-box threat models, we used CNN [11] as a substitute model to attack the defensive EEGNet on the P300 dataset and used ShallowCNN [10] to attack the defensive EEGNet on the MI dataset.The confidence  in the C&W attack is fixed at 5, while the perturbation strength  in the PGD and FGSM attacks is set to 0.1.The number of Monte Carlo samples in BUE and the bandwidth in KDE are set to 50 and 0.5, respectively.In our detection experiments, the dataset contains both normal samples and adversarial samples that result in misclassifications by the model.

The performance of the detection methods
We conduct a comprehensive assessment of the effectiveness of diverse detection methods on both P300 and MI datasets across distinct threat models and various attack methods.Table 1 and Table 2 present the model's performance under diverse attacks and the detection results for adversarial samples.Under the strong attack, a substantial decline is noted in both the average BCA and average RCA compared to the case under no attack.The average BCA and RCA on the P300 and MI datasets are 0.755 and 0.720 under no attack, respectively.This suggests that the model's performance is severely compromised by the three attack methods.In both the P300 and MI datasets, the area under the curve (AUC) of the confidence, PE, and BUE detectors is notably low, while the AUC of the KDE detector consistently surpasses 0.9.This result suggests that all attack methods can generate high-confidence adversarial samples, successfully misleading the classification model.Additionally, these samples exhibit a distinct distribution from that of normal samples.Figure 1 shows the detection results on Subject A in the P300 dataset (Figure 1 (a)) and Subject S1 in the MI dataset (Figure 1 (b)).It is worth noting that under the strong attack, adversarial samples in EEG-based BCIs exhibit low BUE and PE characteristics, deviating from observations in [5] and [9].We posit that EEG data demonstrates a greater susceptibility to adversarial perturbations compared to image data.When the confidence  of the C&W attack is fixed at 5, and the perturbation strength  of FGSM and PGD attacks is set to 0.1, the model is thoroughly misled, resulting in overconfident predictions.In summary, only the KDE-based detector proves effective under the strong attack.
To conduct a comprehensive analysis, we introduced a white-box C&W attack (C&W-wb) [8], specially designed to fool the detector based on KDE.As shown in Table 1 and Table 2, we observed that the AUC of the KDE detector drops significantly on the P300 dataset compared to the case under the strong attack.This indicates that the C&W-wb attack can still affect the detection performance of the KDE detector on the EEG datasets.Interestingly, this attack method generated low-confidence adversarial samples on both datasets, and the AUC of three detectors based on confidence, PE, and BUE achieved higher values.
We further investigated the detection performance of these detectors under the black-box attack.Under this attack, both the average BCA and RCA exhibit higher values compared to the strong and white-box attack scenarios.This implies a decrease in the success rate of black-box attacks.Detectors relying on confidence, PE, and BUE demonstrate improved performance compared to their performance under strong attacks.However, the KDE detector consistently achieved superior AUC under the majority of black-box attacks.

Conclusions
In this study, we conduct a comprehensive assessment of the effectiveness of various adversarial detection methods on two distinct EEG datasets, considering different threat models and employing diverse attack methods.Under the strong attack, all attack methods can generate high-confidence adversarial samples, thereby rendering ineffective detectors based on confidence, PE, and BUE.The detector based on KDE exhibits superior performance across a range of adversarial attacks.However, it can still be affected by C&W-wb, an attack method specifically designed to mislead the KDE detector.
Experimental results show that adversarial samples in EEG-based BCIs have different distributions compared to normal samples.Other detection methods based on estimated sample distributions may also be effective.
a) (b) (a) The detection results on Subject A in the P300 dataset; (b) The detection results on Subject S1 in the MI dataset.

Figure 1 .
Figure 1.The box plots for the detection results of the normal and adversarial samples (FGSM-s represents the detection results of adversarial samples generated under a strong threat model and FGSM attack.).

Table 1 .
The AUC of detectors and the average BCA on the two subjects in the P300 dataset under the different threat models and various attack methods (The numbers in bold indicate the highest AUC of different detection methods under the same attack.).

Table 2 .
The AUC of detectors and the average RCA on the nine subjects in the MI dataset under the different threat models and various attack methods (The numbers in bold indicate the highest AUC of different detection methods under the same attack.).