Deep learning-based auditory attention decoding in listeners with hearing impairment

Objective. This study develops a deep learning (DL) method for fast auditory attention decoding (AAD) using electroencephalography (EEG) from listeners with hearing impairment (HI). It addresses three classification tasks: differentiating noise from speech-in-noise, classifying the direction of attended speech (left vs. right) and identifying the activation status of hearing aid noise reduction algorithms (OFF vs. ON). These tasks contribute to our understanding of how hearing technology influences auditory processing in the hearing-impaired population. Approach. Deep convolutional neural network (DCNN) models were designed for each task. Two training strategies were employed to clarify the impact of data splitting on AAD tasks: inter-trial, where the testing set used classification windows from trials that the training set had not seen, and intra-trial, where the testing set used unseen classification windows from trials where other segments were seen during training. The models were evaluated on EEG data from 31 participants with HI, listening to competing talkers amidst background noise. Main results. Using 1 s classification windows, DCNN models achieve accuracy (ACC) of 69.8%, 73.3% and 82.9% and area-under-curve (AUC) of 77.2%, 80.6% and 92.1% for the three tasks respectively on inter-trial strategy. In the intra-trial strategy, they achieved ACC of 87.9%, 80.1% and 97.5%, along with AUC of 94.6%, 89.1%, and 99.8%. Our DCNN models show good performance on short 1 s EEG samples, making them suitable for real-world applications. Conclusion: Our DCNN models successfully addressed three tasks with short 1 s EEG windows from participants with HI, showcasing their potential. While the inter-trial strategy demonstrated promise for assessing AAD, the intra-trial approach yielded inflated results, underscoring the important role of proper data splitting in EEG-based AAD tasks. Significance. Our findings showcase the promising potential of EEG-based tools for assessing auditory attention in clinical contexts and advancing hearing technology, while also promoting further exploration of alternative DL architectures and their potential constraints.


Introduction
The human brain is a highly complex and sophisticated system, characterized by its ability to perform a wide range of cognitive functions, including attention, perception, and memory.These abilities are largely attributed to the intricate network of neurons have been used to study brain activity patterns to classify different cognitive functions and states, such as auditory and visual attention [2] or memory recall [3,4].Various DL algorithms like convolutional neural networks (CNNs), recurrent neural networks, and Autoencoders have been harnessed to tackle a range of EEG tasks, including classification, decoding, feature extraction, and representation learning [5][6][7][8].
Although DL methods have shown promising results in EEG classification tasks, they face several challenges, including small sample size, high variability, and limited interpretability of the models.Previous studies demonstrate how inter-trial/subject variability can lead to a performance decrement when working with the majority of DL models.This is due to the large differences that can occur in the spatial/spectral EEG patterns across different subjects [9,10].As such, for EEG specific tasks training on trial/subject independent methods becomes a necessity for ease of use, not requiring model re-training with every new subject [11][12][13].
Auditory attention decoding (AAD) is a fundamental cognitive process that allows us to selectively attend to relevant sounds while filtering out irrelevant ones, commonly referred to as the 'cocktail-party effect' [14].In everyday situations, when we engage in conversation in a noisy environment, our brain can effectively suppress background noise and focus on the attended speech.However, listeners with hearing impairment (HI) face significant challenges, particularly in tasks involving stream segregation, where the auditory system groups sounds perceptually to form coherent representations of objects and events in the auditory scene [15].Stream segregation deficits have been shown to impact the ability to separate and focus on specific sound sources, making tasks like distinguishing attended speech from background noise more complex for HI listeners [16,17].Therefore, developing DL-based models effectively integrated into future devices like EEG-informed hearing aids for real-time AAD will greatly assist HI listeners in focusing on and comprehending speech amidst multiple talkers and background noise [18,19].Recent studies have shown promising results in developing such models [20][21][22][23].
The objective of this study is to develop DLbased models for sufficiently fast and accurate EEGinformed AAD for HI listeners.EEG measurements were collected from 31 HI listeners who were instructed to attend to one of the two frontal speech streams while also being exposed to 16-talker babble noise, as described in a previous study where the data were collected [24].The noise data consists of purely background babble noise that plays throughout the EEG measurement while the speech-in-noise data consists of speech from front facing speakers with background babble noise, with the former lasting for 5 s of initial recording and the latter lasting 33 s after the initial 5 s.Additionally, data were collected under two noise reduction (NR) conditions: with NR activated (NR ON) and deactivated (NR OFF).This aimed to understand how hearing aid NR algorithms assist listeners with HI in focusing on specific talkers in noisy environments.
The study focuses on three tasks: distinguishing between noise and speech-in-noise (Task 1), classifying the direction of attended speech (left vs. right, Task 2) and determining the activation status of NR (NR OFF vs. NR ON, Task 3).The classification of noise vs. speech-in-noise has practical implications, optimizing hearing aid functionality in real-time scenarios.Detecting a noise-only situation, among other applications, allows hearing aids to conserve battery and processing resources by refraining from engaging additional features relevant to speech, such as beamforming.It also enables development of algorithms for assessing diverse acoustic environments.Moreover, it holds the potential for objective assessment of hearing capabilities in clinical contexts, facilitating the evaluation of speech perception across different signal-to-noise ratios.Furthermore, we anticipate distinct neural patterns between noise and speech-in-noise classes, reflecting their distinct cognitive processes.The speech-in-noise class may display patterns linked to speech processing and attention, while the noise class may exhibit neural activity associated with non-speech sound processing.Additionally, the comparison between NR OFF and NR ON allows us to discern whether NR alters neural responses, consequently impacting neural mechanisms of selective auditory attention in individuals with HI.
For classification tasks, this study utilizes a deep CNN (DCNN) [25][26][27] as the basis for the DL models, with each of the three tasks trained on a separate DCNN model and evaluated on unseen data.In addition to designing new models for three AAD tasks, two data splitting strategies, inter-trial vs. intra-trial, were explored.In the inter-trial strategy, the testing set used classification windows from trials not seen by the training set, while in the intra-trial strategy, the testing set utilized unseen classification windows from trials where other segments were seen during training.This distinction is needed, as improper data splitting can lead to inflated accuracy metrics that do not truly reflect the performance of DL models on new, unseen data [13,28].The variation in performance may be attributed to variability that exists between different trials, even on the same subject, which leads DL methods to recognize individual trials.This underscores the importance of proper data splitting when working with EEG.Contrary to Puffay et al [13], who focused on normal-hearing listeners in AAD tasks without background noise, our study delves into ecological settings, assessing data splitting effects in listenerseads DL methods to recognize individu with HI.
Our methodology relies solely on EEG signals without incorporating speech signals.Notably, our study focuses on listeners with HI, setting it apart from literature predominantly involving normal-hearing listeners.This distinction aligns our approach with real-world applications, offering insights tailored to the population with hearing difficulties.Additionally, our DCNN models use 1 s EEG samples with a 1 s classification windows, ensuring high AAD accuracies in practical, real-world applications.
This paper is organized as follows.Section 2 discusses the methodology used for data acquisition, processing and model architectures used in this study.Section 3 discusses results obtained in this study on the DCNN models.Section 4 gives a discussion on obtained results and comparison to previous literature, and section 5 concludes the paper.

Data acquisition
The experiment involved 31 native Danish speakers (24 males) aged between 21 and 84 years (mean age = 64.2years, SD = 13.6 years) who had mild to moderately severe symmetrical sensorineural hearing loss (average hearing loss of 47.5 dB on a 4frequency pure-tone audiometry test).They had normal or corrected-to-normal vision and no history of neurological disorders, dyslexia, or diabetes mellitus.All participants were fitted with two identical hearing aids with NR algorithms [20,21,24] and were experienced hearing aid users.The data set used in this study was previously published in [24] with a different analysis approach and adhered to informed consent and ethical standards.
EEG data were recorded from 64 electrodes placed on the scalp according to the international 10-20 system, digitized at 1024 Hz using a BioSemi Active Two recording system (Amsterdam, Netherlands).Two additional electrodes were included, namely an active electrode for common mode sensing and a passive electrode for driven right leg, to function as reference electrodes and two additional electrodes over the mastoids.
During EEG acquisition, participants were comfortably seated in a chair in the center of a soundproof room with six loudspeakers positioned at ±30 • , ±112.5 • , and ± 157.5 • , as illustrated in figure 1(a).During the test, two front-facing loudspeakers (T1 and T2) played news clips of neutral content read by a male and female speaker, while four other loudspeakers (B1-B4) positioned behind the participant played 16-talker (4 × 4 talker) babble noise, increasing the complexity of the task.Speech stimuli were played at 73 dB sound pressure level (SPL) and 4-talker babbles were played at 64 dB SPL Each participant underwent a total of 84 trials.Four of these trials were used to familiarize the participants with the task, while the remaining 80 were used for testing.After each trial, participants were asked a two-choice question related to the content of the attended speech.Each block comprised 20 individual trials, with each trial lasting approximately 40 s.Consequently, each block for each subject had a total duration of approximately 13 min.The trial design can be seen in figure 1(b).During each trial, participants were exposed to competing speech stimuli.The trial commenced with 5 s of background babble noise, succeeded by 33 s of speech emanating from two front-facing loudspeakers.The subjects were instructed to focus their attention on one of the two speakers during this 33 s time window.As a result, we obtained what we refer to as 'speech-in-noise' data.Additionally, we utilized two set of hearing aids with distinct NR systems.The first employed a 16channel system with a fast-acting minimum variance distortionless response (MVDR) beamformer and a Wiener post-filter, while the second utilized a 24-channel system with a higher-resolution MVDR beamformer and a deep neural network (DNN) based post-filter trained to enhance speech-noise contrast (details in [24]).We focused solely on classifying whether the NR system was activated (NR ON) or not (NR OFF), without distinguishing between specific NR systems.Each participant completed 80 trials, evenly split between these two activation states (40 trials each).Within these, 10 trials were dedicated to each of the four selective auditory attention tasks: attending to the (1) left male speaker, (2) left female speaker, (3) right male speaker, and ( 4) right female speaker.The data format from each subject is shown in figure 2.

Data pre-processing
The EEG pre-processing pipeline utilized in this study followed a procedure outlined in a previous work [21].EEG signals were segmented from −15 to 53 s, relative to the onset of the attended speech, with an additional 10 s buffer to eliminate edge artifacts, the initial 15 s of EEG data are added to prevent edge effects during filtering and preprocessing.Additionally, the last 15 s are also included for the same reason.After preprocessing, the data is trimmed to cover the period from 0 s to 38 s relative to the onset of the babble noise.We re-referenced the EEG data using the average of two reference mastoid channels.Next, a digital bandpass filter (0.5-70 Hz) was applied using a zero-phase Hamming window FIR filter with a filter order of 3fs/fc (where fc is the lower cutoff frequency).A narrow-band notch filter (49-51 Hz) was also used with the same type of FIR filter and filter order.To maintain signal phase and prevent delays, we employed both forward and backward filtering through the 'filtfilt' function in MATLAB.The filtered signals were then downsampled to 256 Hz, visually corrupted channels were removed and interpolated.The following non-brain artifacts were removed: eye movements, eye blinks, muscle activity, heart beats, and single-channel noise, using independent component analysis with the runica algorithm based on InfoMax [29].Manual selection was employed to discard an average of 14.6 components per subject, and one participant with noisy data was excluded from the analysis.
In preparation for DL architectures, the data underwent further processing: The study consists of a three two-class classification system: noise vs. speechin-noise, left vs. right attended speech and NR OFF vs. NR ON. 1 s non-overlapping windows were used for sampling the 33 s data of left/right and NR OFF/ON tasks, while overlapping windows (with 0.875 s overlap) were used for the limited data of background babble noise to extend the samples available for this class.The sampling methodology was based on previous studies that used overlapping windows, showing that greater overlap resulted in more samples and improved performance [30][31][32].
Performance and stability during training of DL models can be improved with data normalization.Prior research [33,34] has shown the sensitivity of EEG data to normalization and its impact on neural network training.The normalization involves scaling and standardizing input data to a uniform range.Normalization can accelerate convergence, avoids local optima, and mitigate issues such as exploding or vanishing gradients.For EEG data, this study uses a simple equation to normalize data within a range of (−1-1) for 1 s time windows as Here, x refers to the 1 s time window extracted from the EEG signal, including all channels.Subsequently, each value within the sample is normalized using the maximum and minimum values across all channels within that specific sample.The decision to use the (−1-1) re-scaling for normalization is guided by the need for DL networks working better with normalized/scaled data.In this study, we opted for scaling the data between the given range, allowing the DL models to learn more effectively.

Data split strategies
This study incorporates two data split strategies (inter-trial vs. intra-trial) for the three classification tasks.In the inter-trial strategy, validation/test sets for each participant comprise samples (i.e.classification windows) exclusively from trials unseen by the training set, minimizing overlap of information among the train-validation-test splits.In contrast, the intratrial strategy utilizes data from all trials of each participant where 1 s segments are sampled from each trial, randomized and split into train-validation-test split.Figure 3 illustrates the data splitting for the AAD task (Task 2) using inter-trial strategy.Each participant had four cases, including (1) left male speaker, (2) left female speaker, (3) right male speaker, and (4) right female speaker, each comprising 20 trials.Trials were divided into training (trial #1 to #12), validation (trial #13 to #16), and test (trial #17 to #20).The inter-trial split ensures models' robustness by testing on unseen trials, preventing the model from memorizing trial specific identities.After partitioning, trials were segmented into 1 s classification windows and applied pre-processed (see equation ( 1)).This resulted in 5 s of noise (with 0.875 s overlap) and 33 s of speech-in-noise (with no overlap).

Deep convolution neural network model
The main objective of our study is to employ DL methods for AAD from EEG signals, involving three tasks.Task 1 distinguishes the initial 5 s of background babble noise (with overlap) from the subsequent 33 s of speech-in-noise (without overlap).Task 2 focuses on classifying the direction of incoming speech within the 33 s of speech-in-noise.Task 3 involves discerning between NR OFF and NR ON within the 33 s of speech-in-noise trials.In Task 3, the inter-trial split, illustrated in figure 3, follows a similar strategy to that of Task 2.
To accomplish this, the study proposes a neural network model architecture inspired by EEGNet [27].The proposed method involves developing a neural network for the three classification tasks, and that can also be expanded to cover additional directions for Task 2. This technique is known as locus of attention (LoA) classification and has been previously investigated with promising results in [12,13,35].In contrast to stimulus reconstruction (SR) methods that use linear or non-linear regression, including DL-based models, to reconstruct sound stimuli and match them to the attended sound [19,[36][37][38][39][40], the LoA methods focus on classifying the direction of the attended sound.LoA methods offer advantages such as not requiring access to speech streams.Additionally, unlike SR methods where AAD performance significantly decreases for short classification windows [40], such a decrease is not reported for LoA methods [12].
Figure 4 shows the model architecture for a classification task using a 1 s time window from EEG signals.The model first extracts temporal features from each channel using a 2D convolution kernels, allowing it to capture channel-specific features.Three temporal convolution blocks are used, to extract features across different filter sizes, enabling the model to capture features more effectively.Each temporal convolution block is followed by batch normalization to normalize feature maps, followed by Mish activation function [41].Next, the model extracts spatial features across all channels with a fixed kernel size of 1 along the temporal dimension which allows the model to extract features across all channels simultaneously, the kernel shapes given in figure 4 outline the specifics for each layer.Batch normalization and average pooling are applied to downscale the data by a factor of 4, followed by a 2D dropout layer to prevent overfitting.The model then extracts features from one-dimensional data, applies batch normalization and average pooling to further reduce the data, and uses the Mish activation function in both the previous and current layers.The flattened data is then processed through fully connected layers to generate an output logit that is then passed through a Sigmoid activation function.This choice aligns with our use of the binary cross-entropy loss function (PyTorch implementation), which necessitates output logits be passed through a Sigmoid activation function prior to the loss function.The three models are trained separately for each classification task and are trained using the same architecture and hyperparameters.The model's hyperparameters were fine-tuned through manual adjustments exclusively based on the observed results from validation data.We utilized the Adam optimizer with default parameters, except for the learning rate, set at 0.0005.This value was chosen after trial and error, where larger values hindered model learning, while smaller values caused the model to learn too slowly.

Model training
The models for Task 1 (speech vs speech-in-noise), Task 2 (left vs right attended speech) and Task 3 (NR OFF vs NR ON) were uniformly trained for 50 epochs using uniform training parameters, hyperparameters, and model initialization.This included using the same seed value for initialization, ensuring that any differences between them were solely due to the data used and its potential impact on model training.Figure 5 displays the training graphs for these tasks, employing both data split strategies to highlight training differences.Despite significantly better generalization indicted by the intra-trial strategy, we acknowledge that this approach might yield overly optimistic results.The increased generalization results from the similarity between classification windows within each trial.In contrast, the inter-trial strategy, while not as effective in generalization, provides more realistic results.
Figure 5(a) illustrates the training graph for Task 1, showing accuracy and loss over 50 epochs spaced over 6 steps per epoch.Under the inter-trial strategy, the model starts to overfit relatively early, evidenced by increasing validation loss, while it continues to generalize for the intra-trial strategy.In    5(c) shows the training graphs for Task 3, indicating the highest generalization for the intra-trial strategy and no overfitting observed for the inter-trial strategy.These variation across align with their different nature.Notably, each task consistently performs better on the intra-trial cases, potentially leading to inflated results that may not generalize well to real world scenarios.Throughout the training process, we saved the best-performing models based on the lowest validation loss and highest validation accuracy per epoch.

Classification results
This section presents the results of the classification tasks for individual subjects in both inter-and intratrial versions of the models, as shown in figure 6.The test data constitutes 20% of the total data, maintaining a common 60-20-20(%) train-validation-test ratio.It is important to emphasize that the results presented for each subject are not specific to any particular model.A single model is trained (on all subjects) and then tested across each subject (individually), with each test subject consisting of the specific test trials (for inter-trial) or specific test samples (for intra-trial) relevant to that subject.As observed in figure 6, all three models perform less effectively in the inter-trial splitting strategy due to the overlapping information that can exist between samples within trials resulting in more bloated results for the intratrial splitting; this trend is consistent across all three tasks.Furthermore, some subjects show a significant performance boost when trained on intra-trial data split.For example, in Task 3, subjects #9, #14 and #19 show a boost in performance from below 50% in inter-trial to approximately 100% in intra-trial, highlighting the critical role of proper data splitting for EEG-based AAD tasks.Task 2 exhibits a smallest increase in accuracy, possibly due to the task's inherent complexity.
The performance of the models is shown in figure 7.In figure 7(a), the intra-trial models demonstrate enhanced performance across all three tasks, with elevated mean/median values and decreased variability, possibly due to memorizing trial-specific identities.Figure 7(b) illustrates paired differences between the inter-trial and intra-trial strategies.
In table 1, we present average performance of our models using both inter-trial and intra-trial data split strategies.For the inter-trial split, the noise vs. speech-in-noise model achieved 69.8% accuracy and 77.2% area-under-curve (AUC) in Task 1, left vs. right attended speech model achieved 73.3% accuracy and 80.7% AUC in Task 2, and NR OFF vs. NR ON model reached 82.9% accuracy and 92.1% AUC in Task 3. In Task 1, specificity gauges speech-innoise identification, Task 2 evaluates identification of attended speech on the right, and Task 3 classifies NR ON samples.Comparatively, the intra-trial data split strategy indicates significantly improved performances across all three tasks compared to their inter-trial counterparts.
In Task 1, the noise vs. speech-in-noise model's accuracy increased from 69.8% to 87.9%.Task 2. the left vs. right attended speech model, showed an improvement from 73.3% to 80.1%.Task 3, NR OFF vs. NR ON model, saw a boost from 82.9% to 97.5%.Independent two-sample t-tests confirmed the significant differences between the inter-trial and intratrial results for all three tasks, as detailed in table 1.In all cases, a significant difference between the two sets of results is observed.The significant improvements observed with the intra-trial strategy highlight The performance of the binary classification models can be evaluated using ROC curves, which show the trade-off between the true positive rate and the false positive rate at different classification thresholds.Figure 8 displays the ROC curves for the three tasks, with the black dashed line representing the worst possible results.The results show that the intra-trial versions of the models in all three tasks outperform the inter-trial versions, with Task 2 showing the most noticeable visual improvement of the three tasks.

Discussion
The DCNN models in this study exhibited reasonably good performance with inter-trial data split strategy, even using short 1 s time windows.However, Task 1, involving noise and speech-in-noise classification, showed limited accuracy (69.8%), indicating overfitting and poor generalization due to scarcity of training data.With only 5 s of background noise per trial and limited variability, extracting meaningful features became challenging.Augmenting the data and exploring different DL architectures, including parameter adjustments and more complex structures, could prove beneficial for Task 1.  Task 2 results revealed promising accuracy (73.3%) in distinguishing between left from right attended speech with the inter-trial data split strategy, indicating potential of EEG in real-world AAD.With the intra-trial strategy, accuracy showed an increase to 80.1%, suggesting potential effectiveness of the DCNN models.However, it is essential to acknowledge prevalent limitations for realworld applications.A two sample t-test between the two strategies (p = 0.006) confirmed the significant   [20,21,24].Both inter-trial and intra-trial models exhibited promising results.The model using the inter-trial split strategy achieved an accuracy of 82.9%, signifying the potential of the DCNN model for this task.The model using the intra-trial data split strategy, on the other hand, exhibited a significantly higher accuracy of 97.5%, largely to the model's ability to generalize to trial specific features.Nonetheless, the high accuracy observed in the inter-trial split strategy suggests that the DCNN model holds promise for this task.
The study's use of short time windows for classification and testing on new subjects is a notable feature.The model's adaptability to novel data is evident, performing well without prior knowledge of the specific trials; the test set (for inter-trial) consists entirely of unseen trials across all subjects.This boosts our confidence in models' effectiveness in real-world applications.The DCNN models exhibit promising results, establishing their efficacy for the designated tasks.Future work could explore enhancing performances through data augmentation techniques.One potential direction could involve integrating conditional variational autoencoder models with other augmentation methods like generative adversarial networks [42][43][44].Additionally, recent research suggests that electrooculogram (EOG) activity might assist in AAD, implying that EEG alone may not solely contribute to the success of DNN architectures in LoA tasks for AAD [45].This finding is significant because it suggests that the DNN might be capturing additional information beyond EEG that aid in LoA tasks for AAD.However, in Task 3 (NR OFF/ON), where EOG activity may not be relevant, the high accuracies observed in our study suggest that the proposed DNN architecture is capable of extracting EEG-specific features while disregarding EOGrelated information.
In contrast to prior studies, our research focuses on listeners with HIs.People with HIs often exhibit enhanced cortical representations of the specific speech envelope they concentrate on.This characteristic has been underscored in the investigations carried out by [46,47].This phenomenon could potentially contribute to the heightened accuracy in decoding that we observed in our study.In forthcoming investigations, we plan to expand our scope to include listeners with normal hearing.Additionally, we aim to explore a range of data augmentation strategies across diverse populations with varying levels of hearing abilities.While many studies involve two competing talkers, our experimental design introduces additional complexity by including a 16-talker background babble noise, enhancing realism but also increasing auditory task difficulty.Additionally, the use of NR (OFF or ON) complicates the data further.For an equitable comparison, it is needed that all models are evaluated using the same data set(s).Additionally, the intra-trial strategy should be avoided due to the heightened risk of overestimating performance.
To provide context for our results, we briefly discuss the findings of several relevant studies.Mirkovic et al [48] achieved an accuracy of 88.0%, but this was based on a lengthy 60 s time window, which may not be suitable for real-time applications.Similarly, Fuglsang et al [51] reported high accuracies ranging from 80%-90% but used time windows of 40-50 s.Although both studies present promising results, the large time windows may make it impractical in realworld scenarios.
In contrast, Das et al [49] achieved accuracies of 76.0% and 87.2% for non-subject-specific and subject-specific decoders, respectively.While subjectspecific decoders yielded higher accuracy, their feasibility for widespread use may be limited.Mirkovic et al [50] explored two EEG decoding methods, cap-EEG and ear-EEG, with average accuracies of 84.8% and 69.3%, respectively.The latter suggested that unobtrusive miniaturized ear electrodes could successfully decode attended speakers in two-speaker scenarios.However, our proposed methodology surpasses Mirkovic et al [50] in accuracy using shorter time windows.Moreover, [11] achieved an 80% accuracy in decoding attended speech direction using filterbank common spatial pattern filters, utilizing 1 s time windows, comparable to our proposed approach.
Other studies [12,53] employed DL architectures for AAD, achieving accuracies of 67.8% and 81.0%, respectively, with 2 s time windows.Su et al [35] achieved average accuracies of 71.9% and 90.1% using publicly available data sets [38,54].Our proposed method yielded comparable results, particularly aligning with the latter accuracy.Notably, our study involved EEG data from hearing-impaired subjects in a complex listening task with competing talkers and background noise, while Su et al's study used EEG data from normal-hearing listeners in a competing talker task without background noise.
Lastly, in alignment with recent study by Puffay et al [13], the intra-trial models significantly outperformed their inter-trial counterpart, leveraging similarities among classification windows within the trial.This facilitates the learning of trial-specific features that generalize to unseen windows within the same trial.However, it is crucial to note that this enhanced performance, while anticipated, could be misleading in result interpretation.
In summary, our study distinguishes itself by focusing on hearing-impaired subjects, contributing to a better understanding of AAD in real-world scenarios.Our approach leverages DCNNs, achieving a high accuracy of 73.3% on inter-trials with 80.1% on intra-trials using 1 s time windows.While direct comparisons across different studies are complex due to data set variations and varying strategies for performance evaluation, our findings provide valuable insights for the neuroscience community, particularly in the context of HI.

Conclusion
Our study underscores the potential of utilizing DCNN models for decoding of auditory attention from EEG data, focusing uniquely on individuals with HI.Unlike previous studies centered around individuals with normal hearing, our results provide insights directly relevant to challenges faced by those with HI.Remarkably, our proposed method achieves high performances using short 1 s EEG time windows, indicating its robustness and practicality in real-life applications.Moreover, our DCNN models effectively handled short 1 s EEG windows, making them suitable for real-world applications.The intratrial strategy significantly improved the performance across all tasks compared to the inter-trial strategy.Our findings confirm the important role of proper data splitting, as suboptimal methods can artificially inflate performance metrics.These results highlight the promising potential of EEG-based tools for assessing auditory attention in clinical contexts and advancing hearing technology.Moreover, they advocate for further exploration of alternative DL architectures and their potential limitations.

Figure 1 .
Figure 1.(a) Visualization of the experimental setup used for gathering data.(b) Trial design.Adapted from [21].CC BY 4.0

Figure 2 .
Figure 2. Subject wise data acquisition, each subject has 2 noise reduction scenarios: NR OFF and NR ON, with each scenario containing 40 trials each, with 10 trials coming from each attention task.

Figure 3 .
Figure 3. Inter-trial data split for DCNN training incorporating all (four) cases across each subject, with 20 trials across each case.

Figure 4 .
Figure 4. Deep learning architecture for classification tasks.

figure 5 (
figure 5(b), the training graph for Task shows a similar trend toward overfitting, though occurring later in the training process.Figure 5(c) shows the training graphs for Task 3, indicating the highest generalization for the intra-trial strategy and no overfitting observed for the inter-trial strategy.These variation across align with their different nature.Notably, each task consistently performs better on the intra-trial cases, potentially leading to inflated results that may not generalize well to real world scenarios.Throughout the training process, we saved the best-performing models based on the lowest validation loss and highest validation accuracy per epoch.

Figure
figure 5(b), the training graph for Task shows a similar trend toward overfitting, though occurring later in the training process.Figure 5(c) shows the training graphs for Task 3, indicating the highest generalization for the intra-trial strategy and no overfitting observed for the inter-trial strategy.These variation across align with their different nature.Notably, each task consistently performs better on the intra-trial cases, potentially leading to inflated results that may not generalize well to real world scenarios.Throughout the training process, we saved the best-performing models based on the lowest validation loss and highest validation accuracy per epoch.

Figure 5 .
Figure 5. Training/Validation graphs for accuracy and loss using both data split strategies: (a) Noise vs. speech-in-noise model (b) Left vs. right attended speech model and (c) NR OFF vs. NR ON model.

Figure 6 .
Figure 6.Subject wise classification results on test samples specific to each subject, across models for Task 1, Task 2 and Task 3 for both data splits.Dashed lines represent the average accuracies for the models.

Figure 7 .
Figure 7. Box plots for Task 1 (noise vs. speech-in-noise), Task 2 (left vs. right attended speech) and Task 3 (NR OFF vs. NR ON), for both data splits.(a) Individual model performance.(b) Paired differences between inter-trial and intra-trial models.

Figure 8 .
Figure 8. ROC curve for DCNN models for each task across both data split strategies.The significant improvements observed with the intra-trial strategy is believed to be artificial and not to be trusted.

Table 1 .
Classification results for the three tasks across both data split strategies.Task 1: noise vs. speech-in-noise, Task 2: left vs. right attended speech and Task 3: NR OFF vs. NR ON. * Independent two sample t-test, showing significant results at p-value < 0.05.

Table 2 .
Comparison of EEG-based auditory attention decoding performance between our study and previous literature, including stimulus reconstruction (SR) and locus of attention (LoA) models for normal-hearing (NH) and hearing-impaired (HI) subjects using linear regression (LR), deep neural network (DNN) and (deep) convolutional neural networks ((D)CNN).

Table 2
presents a comparison of our proposed AAD method with prior research.It's important to note that making a direct comparison across different studies is challenging due to variations in data sets, subject characteristics, and experimental designs.