Brought to you by:
Paper

Effects of speech transmission quality on sensory processing indicated by the cortical auditory evoked potential

, and

Published 29 July 2020 © 2020 IOP Publishing Ltd
, , Citation Stefan Uhrig et al 2020 J. Neural Eng. 17 046021 DOI 10.1088/1741-2552/ab93e1

1741-2552/17/4/046021

Abstract

Objective. Degradations of transmitted speech have been shown to affect perceptual and cognitive processing in human listeners, as indicated by the P3 component of the event-related brain potential (ERP). However, research suggests that previously observed P3 modulations might actually be traced back to earlier neural modulations in the time range of the P1-N1-P2 complex of the cortical auditory evoked potential (CAEP). This study investigates whether auditory sensory processing, as reflected by the P1-N1-P2 complex, is already systematically altered by speech quality degradations. Approach. Electrophysiological data from two studies were analyzed to examine effects of speech transmission quality (high-quality, noisy, bandpass-filtered) for spoken words on amplitude and latency parameters of individual P1, N1 and P2 components. Main results. In the resultant ERP waveforms, an initial P1-N1-P2 manifested at stimulus onset, while a second N1-P2 occurred within the ongoing stimulus. Bandpass-filtered versus high-quality word stimuli evoked a faster and larger initial N1 as well as a reduced initial P2, hence exhibiting effects as early as the sensory stage of auditory information processing. Significance. The results corroborate the existence of systematic quality-related modulations in the initial N1-P2, which may potentially have carried over into P3 modulations demonstrated by previous studies. In future psychophysiological speech quality assessments, rigorous control procedures are needed to ensure the validity of P3-based indication of speech transmission quality. An alternative CAEP-based assessment approach is discussed, which promises to be more efficient and less constrained than the established approach based on P3.

Export citation and abstract BibTeX RIS

1. Introduction

The impact of degraded speech on human listeners has been independently studied in telecommunications quality engineering [13], audiology and auditory neuroscience [47]. A recent series of experiments in the former field manipulated quality impairment factors common to speech transmission pathways and terminal equipment like packet/frame loss, multiplicative noise, bandwidth limitation and bit rate reduction (for reviews, see [8, 9]). In listening-only situations, electrical brain activity was recorded using electroencephalography (EEG) [10] to attain new insights into when, how and to what extent the induced speech quality degradations affected auditory change detection and discrimination [1114]. Amplitude and latency characteristics of the P3 (also: P300) component [15, 16] of the event-related brain potential (ERP) [10] were implied as neural correlates of perceived quality change: Higher magnitudes of quality impairment were reliably accompanied by increased P3 amplitudes and reduced P3 latencies. Moreover, depending on the stimulus context, different kinds of quality impairment resulted in specific P3 response patterns [17, 18]. These effects on P3 have been ascribed to varying 'neuronal effort' mobilized for perceptual and cognitive processing and corresponded closely with behavioral task performance [1214, 17, 18].

The present study aims at extending psychophysiological assessment of speech transmission quality from late portions of the ERP signal (containing P3) to earlier portions that are primarily associated with sensory processing of acoustic stimulus features. Demonstrating the existence of early quality-related neural modulations is expected to have relevant methodological implications since assessment reliability and sensitivity would depend more strongly on the acoustic surface form of transmitted speech signals than previously assumed.

1.1. P3-based assessment of speech transmission quality

Prior theoretical assertions presumably motivated the initial decision to utilize the P3 component for psychophysiological quality assessment. Conceptual models of internal quality formation have reduced 'quality' to the subjective experience associated with a cognitive evaluation process [2, 3, 19]: Accordingly, a quality judgment is understood as resulting from the comparison of expected and perceived quality features, referring to a subset of descriptive stimulus features taken into account by an individual person who evaluates quality, while being situated in a multi-layered (e.g. temporal, physical, task etc see [20, 21]) context. Yet, the immediate experience of quality might already be contained in the conscious percept of a given stimulus, hence rendering it a higher-order perceptual attribute of perceived quality [17, 22] that is more evaluative than descriptive in character [2325]. Decompositions of the 'classic P3' waveform have revealed at least two subcomponents, P3a and P3b, which are linked to attentional orienting and perceptual categorization, respectively [15, 26, 27]. Thus, due to their acknowledged functional roles in perceptual and cognitive processing, both subcomponents could be considered suitable psychophysiological indicators of quality percepts and judgments.

In addition, also methodological limitations might have been responsible for research on speech quality perception and cognition taking priority over sensory processing of impaired speech signals. Past studies relied mostly on psycho-acoustic techniques in order to establish relationships between different kinds and intensities of speech signal impairments and behavioral descriptions of perceived speech quality [13, 2829]. Respective protocols for subjective multimedia quality assessment have evolved into standardized recommendations by the International Telecommunication Union (ITU) [31]: Participants are usually required to reflect on their quality percepts of test stimuli, to arrive at an explicit quality judgment, encode it within the range of the quality rating scale at hand and produce the corresponding behavioral description [3, 19]. However, since most sensory and perceptual processes occur in an entirely implicit—that is unconscious, non-intentional and automatic—fashion, they are largely inaccessible to explicit—that is conscious, intentional and controlled—reflection. Because of these constraints of commonly employed measuring instruments, for a long time the investigative focus remained fixed on higher-cognitive and response-related processing, the assessment of which could be ensured with sufficient validity [30].

1.2. Cortical auditory evoked potential (CAEP): P1-N1-P2 complex

Whether sensory processing is influenced by quality impairment factors during speech signal transmission has remained an open question [1]. In audiology and auditory neuroscience, the portion of the ERP signal associated with distinct sensory processes, the cortical auditory evoked potential (CAEP), has been studied under controlled listening conditions using relatively simple test stimuli (e.g. tones, syllables) [6, 32, 33]. Three long-latency CAEP components, P1, N1 and P2 (also: P50, N100, P200)—often treated as a combined single P1-N1-P2 complex—occur time-locked to the onset and offset of presented stimuli [34]. Following acoustic change within an ongoing stimulus (e.g. transitions between two syllables), P1-N1-P2 is referred to as acoustic change complex (ACC) [35, 36] or second N1-P2 (also: C[hange]-complex, see [6, 37]). If a stimulus deviates to a sufficient degree from a preceding stimulus sequence, the mismatch negativity (MMN) and P3a are triggered, reflecting sensory change detection and automatic attentional resource allocation, respectively [38, 39]. If the stimulus further demands participants to act through a cognitive or motor response, the P3b can reflect cognitive load (i.e. the amount of attentional resources mobilized for task engagement) [15, 16, 26]. The constituent components of P1-N1-P2 are describable as exogenous due to their evoked responses being primarily affected by acoustic stimulus features [6, 32, 33]. On the contrary, response modulations of late-latency, endogenous components like MMN, P3a and P3b are increasingly determined by internal factors (e.g. selective attention, task relevance categorization).

The global function associated with P1-N1-P2 lies in early obligatory processing of sensory signals to provide information for later perceptual and cognitive processing stages [6, 3234, 36]. Consequently, the extent and sensitivity of P1-N1-P2 modulations should determine listeners' basic capacity for sensory change detection (MMN, P3a), perceptual discrimination and behavioral response initiation (P3b) [5, 6, 32, 33, 40]. Types of acoustic change known to evoke P1-N1-P2 include variation in intensity [41, 42], frequency [43] and timing (e.g. silent gap duration, voice onset time) [35, 44, 45] of tones, noise bursts and syllables, variation in initial speech sounds [4648] and speech sounds within syllables [35]. Higher magnitudes of intensity, frequency and timing changes generally correspond with faster and larger P1-N1-P2 responses as quantified by reduced latencies and increased amplitudes, respectively.

N1 is proposed to consist of a number of subcomponents which reflect different sensory processes [49, 50], including an 'initial readout of [newly available] information from the sensory analyzers' [49] (p. 413); and the formation of a sensory memory representation of current stimulus features that is later matched to a representation of the previous stimulation [51, 52]. If both representations differ in some stimulus features, the MMN is indicative of this pre-attentive comparison process [39, 53]. Contrary to N1, in-depth knowledge about the specific functions of the other two positively polarized components, P1 and P2, remains scarce. The P1 has been associated with a process of sensory gating, that is, the initial filtering of redundant sensory input from the stimulatory environment [37]. Cumulative evidence has further demonstrated distinct neural generators for N1 and P2, pointing towards separate sensory processes underlying P2, the exact functions of which are yet to be uncovered [54].

A conceptual framework of quality perception has been introduced in a previous study [17], which integrates theoretical viewpoints held in quality engineering [2, 19, 20] and psychophysiology [15, 26, 27, 38, 39]. The derived conceptual framework already contained a sensory processing stage relying on sensory memory. The present study fills this smaller 'black box' of sensory processing with more detailed processes and links them to components of the P1-N1-P2 complex. Figure 1 shows the resulting extended version of the framework: It assumes initial filtering of incoming sensory signals (sensory gating), as reflected by P1; thereafter follow the sensory encoding and analysis of physical stimulus features, accessing of this new sensory information and forming of a sensory memory representation, as indicated by N1-P2.

Figure 1.

Figure 1. Conceptual framework of quality perception in the context of the oddball paradigm [17], extended by a more detailed sensory processing stage (ellipse = internal process, rectangle = external event, pentagon = influencing factor related to the oddball task, cloud = subjective experience). Steps of internal information processing are functionally associated with components of the event-related brain potential (ERP), covering both cortical sensory evoked potentials (CSEPs; e.g. auditory P1-N1-P2), and late ERP (MMN, P3a, P3b) components. CSEP measures are immediately affected by any variation in physical features of the presented stimulus; specific sensory processes, presumably reflected by P1 and N1-P2 components, are marked in blue color. Action effects in the environment may be captured by behavioral measures. The schematic is based on figure 2.4 in [19], copyright 2014, with permission from Springer.

Standard image High-resolution image

1.3. Present study: analysis of CAEP components

The present study puts forward analyses of electrophysiological data gathered in two companion studies (referred to below as study 1[17] and study 2[18]), each of which employed a three-stimulus variant of the classic oddball paradigm to examine effects of speech quality degradations on P3. Three-stimulus oddball tasks comprise three stimulus types that vary in presentation rate and task relevance [15]: 'Standards' occur with high frequency (e.g. 70%) and are always task-irrelevant. Two types of 'oddballs' occur with low frequency (e.g. 15% each), being either task-relevant 'targets' or task-irrelevant 'distractors'. Single oddballs are randomly interspersed into the continuous standard sequence. Task relevance is manipulated by instructing participants to actively respond to occurring targets, either through overt motor behavior (e.g. pressing a push button) or covert mental activity (e.g. mental counting). Target presentations were found to trigger both P3a and P3b, distractor presentations to elicit only P3a [15, 16].

Using psycho-acoustic techniques, perceptual dimensions of 'noisiness' and 'coloration', have been identified as contributing to integral speech quality, when loudness level is held constant [29, 55]. In study 1 [17], this underlying perceptual dimensionality of speech quality was varied in the standard and oddball (target, distractor) by applying either signal-correlated noise ('noisiness'-impaired) or bandpass filtering ('coloration'-impaired) to a spoken word stimulus. In study 2 [18], two types of targets were presented, involving either a single change in quality (T-single) or double changes in quality and spoken word content (T-double). For the content manipulation, two words with different initial phonemes were presented.

As both previous studies initially set out to examine effects on P3, their ERP parameter analyses were exclusively based on oddball (i.e. target, distractor) trials, leaving the much higher numbers of standard trials unused. Despite that, each experimental design had assigned all high-quality and quality-impaired stimuli to serve as standards. This circumstance opened up possibilities to exploit the favorable signal-to-noise ratios of aggregated standard trials. Thus, whereas the original analyses relied on ERPs elicited by oddball presentations, the analyses conducted in the present study will center around CAEPs after standard presentations. Emphasis will be placed on the P1-N1-P2 complex to stimulus onset, leaving the ACC and auditory evoked responses to stimulus offset aside. This procedure is not entirely new since CAEP component parameters have been extracted from standard trials of active oddball paradigms before to analyze neural correlates of speech-in-noise [40, 5658]. Also Porbadnigk et al have first extracted features from CAEP-range time windows to train a classifier that could successfully differentiate between strongly degraded and high-quality spoken vowels [14]. In their conclusions, however, the authors rather highlighted identified correlations between P3 features and behavioral discrimination performance.

Initial inspection of ERP waveforms for target stimuli obtained in studies 1 and 2 revealed visible negative deflections in the time range of N1, which appeared to be stronger for coloration- than noisiness-impaired oddball stimuli [17, 18]. Nonetheless, these observations were entirely exploratory in character since only a priori formulated hypotheses concerning P3 parameters were tested in the previous analyses. Taken together with the available knowledge base on the P1-N1-P2 complex (see section 1.2), they raised the question of whether early evoked negativities might have carried over into the late ERP, causing or at least contributing to the observed P3 modulations. In other words: Where is the actual 'locus of experimental effect' [59]? In late ERP components like the P3, as originally assumed, or already in CAEP components reflecting sensory processing? As a first step to answer this crucial question, Porbadnigk et al examined effects of different magnitudes of signal-correlated noise (i.e. varying its perceived intensity or degree of noisiness) on unspecified evoked neural activity [14]. The present analyses aimed at a more systematic investigation into the effects of different kinds of speech quality degradations (signal-correlated noise vs. bandpass filtering) on CAEP component parameters.

To test for the existence of significant quality-related modulations in CAEP components, analyses of previously collected electrophysiological data were carried out that expanded the original ones in the following ways:

First, separate time windows were used to quantify individual components of the P1-N1-P2 complex. The a priori selection of these time windows will be based on the so called 'collapsed localizer' approach [60, 61]: According to this approach, time windows for ERP parameter extraction can be pre-specified by visually inspecting the temporal component structure of ERP waveforms, which have been grand averaged ('collapsed') over experimental conditions of interest to the investigator.

Second, in addition to peak amplitude and latency, also mean amplitude was extracted. Amplitude modulations were assumed to have higher convergent validity when being observed across both amplitude parameters1.

Third, P3 parameter analysis in study 2 compared experimental conditions by aggregating over presentations of different spoken German words (/haus/, /maus/; English: 'house', 'mouse'). Although both words were recorded from the same male speaker, most certainly the onset P1-N1-P2 would be modulated by initial phonological variation (e.g. unvoiced /h/ vs. voiced /m/), possibly even interacting with the induced kinds of quality impairments. Thus, an additional fixed factor word (/haus/, /maus/) was included in the present analysis.

2. Methods

In the following subsections, details on study 1[17] and study 2[18] are provided to the degree necessary within the scope of the present study. All methodological additions to the present analyses will be explained in full detail.

2.1. Participants

Participants were native German speakers without any reported hearing problems (study 1: N1 = 28, mean age: 29.7 ± 9 years, 17 female, 1 left-handed; study 2: N2 = 34, mean age: 27.3 ± 5.7 years, 20 female, 1 left-handed). In study 2, participants further underwent a fixed-frequency Békésy audiometry, which objectively confirmed their normal hearing capacity. All participants gave their informed consent to partake in the experiments. In both studies, experimental data were collected in accordance with ethical principles for medical research involving human subjects forwarded by the World Medical Association (WMA) declaration of Helsinki and guidelines by the ethics committee of Faculty V of Technische Universität Berlin.

2.2. Stimuli

As specified in table 1, the previous studies used high-quality (HQ) and quality-impaired (N = noisiness-impaired, C = coloration-impaired) stimuli generated from spoken word recordings (study 1: /haus/, English: 'house'; study 2: /haus/, /maus/, English: 'mouse'). Each induced quality impairment affected only a single perceptual quality dimension at a time ('noisiness' vs. 'coloration'; see [29, 55]). To induce different kinds of quality impairments of certain magnitudes onto the speech files (signal-correlated/multiplicative noise, spectral distortion), appropriate impairment methods (modulated noise reference unit[62], bandpass filter) with pre-specified parameter settings were applied. All stimuli were finally normalized to an active speech level of -26 dBov (dBov: decibels relative to the overload point of the digital system) [63]. Thus resulted three stimuli for study 1 (HQ, N, C) and six stimuli for study 2 (HQ-/haus/, N-/haus/, C-/haus/, HQ-/maus/, N-/maus/, C-/maus/). Oscillo- and spectrograms of all speech signals are depicted in figure 2.

Figure 2.

Figure 2. Stimuli. Study 1 used high-quality (HQ) and quality-impaired (N = noisiness-impaired, C = coloration-impaired) variants of the German word /haus/ (a), spoken by a female voice. Study 2 used HQ and quality-impaired (N, C) variants of two German words, /haus/ (b) and /maus/ (c), each time spoken by the same male voice. Speech signals are displayed in the time domain (top row) and time-frequency domain (bottom row).

Standard image High-resolution image

Table 1. Stimulus specifications. High-quality, N = noisiness-impaired, C = coloration-impaired. MNRU = modulated noise reference unit [32, 63] (Q: ratio of speech power to modulated noise power, dBov: dB relative to the overload point of the digital system). FIR = finite impulse response.

Study Duration Voice Gender Word Quality Impairment Method Settings
1 500 ms Female /haus/ HQ - -
        N MNRU Q = 18 dBov
        C FIR bandpass filter 0.1–2 kHz
2 700 ms Male /haus/ HQ - -
        N MNRU Q = 14 dBov
        C FIR bandpass filter 0.2–2.8 kHz
      /maus/ HQ - -
        N MNRU Q = 14 dBov
        C FIR bandpass filter 0.2–2.8 kHz

2.3. Oddball tasks

The oddball tasks were carried out in a sound-attenuated laboratory room in accordance with ITU-R Rec. P.800 [31]. Participants binaurally listened to stimuli through closed-ear headphones (Sennheiser HD 280), fixating a white cross displayed on a monitor screen in front of them. The sound pressure level was measured (using NTI Audio AL1 Acoustilyzer) and adjusted to approximately 65 dB to ensure a comfortable listening level. Table 2 lists oddball task specifications for each study, which in all cases included three stimulus types. Figure 3 shows exemplary trial sequences, comprising presentations of standards (S) and two types of oddballs—either one target (T) and one distractor (D) in study 1 or two target types (T-single, T-double) in study 2. Each stimulus type was presented with a certain probability and did (T, T-single, T-double) or did not (S, D) demand participants to execute a defined behavioral response. In both oddball tasks, participants first listened to instances of each stimulus type for memorization, before starting the actual test block; they were instructed to respond only to the current target stimulus (i.e. T in study 1; T-single and T-double in study 2) and ignore any other non-target stimuli (i.e. S; also D in study 1). The required response was to press a push button, as fast and as accurately as possible, which the participant held in his/her dominant hand. Stimulus sequences were pseudo-randomized, such that at least one standard occurred between two consecutive oddballs. Inter-stimulus intervals were 1200 ms, jittered by ± 200 ms. Stimulus presentation was controlled by Psychophysics Toolbox Version 3 (PTB-3)2 for MATLAB, using a high-quality audio interface (Cakewalk by Roland UA-25 EX).

Figure 3.

Figure 3. Oddball tasks employed in study 1 (a) and study 2 (b). High-quality and quality-impaired spoken words (/haus/, /maus/) are serially presented as different stimulus types (S = standard, T = target, D = distractor; T-single = target involves single feature change, T-double = target involves double feature change). Only standard trials marked in blue color are considered for further analysis (see section 2.5). The schematic is based on figure 1 in [15], copyright 2007, with permission from Elsevier.

Standard image High-resolution image

Table 2. Oddball task specifications. S = standard, T = target, D = distractor; T-single = target involves single feature change, T-double = target involves double feature change.

Study Stimulus Type Frequency (%) Behavioral Response
1 S 70 no
  T 15 yes
  D 15 no
2 S 70 no
  T-single 15 yes
  T-double 15 yes

2.4. EEG data collection

Electrophysiological data were recorded via an active EEG system (g.GAMMAsys) with 16 electrodes positioned according to the extended 10-20 system. The reference electrode and the ground electrode were placed on the mastoids. Sampling rates were set to 512 Hz. As the P1-N1-P2 complex is usually most pronounced along the central midline [6], the present analyses only concerned EEG signals at electrode position Cz.

2.5. EEG data processing

The gathered electrophysiological data were already (pre-)processed using the EEGLAB toolbox for MATLAB, which included down-sampling (new sampling rate: 256 Hz), bandpass filtering (low-cutoff = 0.1, high-cutoff = 40 Hz), segmentation into stimulus-locked epochs (pre-stimulus time = 0.2 s, post-stimulus time = 1 s) and baseline correction. For each participant, an amplitude threshold was determined across all channels by an iterative procedure, that would result in rejection of 10% of available epochs; afterwards, epochs outside the individual amplitude threshold were discarded. Ocular artifacts due to eye movements and blinks were removed based on an independent component analysis. Averaging of single-trial epochs resulted in single-subject and grand average epochs. Since ERPs to standards that immediately succeed oddballs have been reported to contain MMN components [52, 64], all post-oddball standard trials were removed from subsequent analysis as illustrated in figure 3 (see [58] for a similar procedure).

2.6. CAEP quantification

Three CAEP parameters were extracted from post-stimulus time windows within single-trial epochs to quantify individual components of the P1-N1-P2 complex: Peak amplitude, defined as the maximum/minimum voltage value; Mean amplitude, defined as the average voltage value; Peak latency, defined as the time interval between stimulus onset and the maximum/minimum voltage value.

The extraction time windows were chosen in accordance with the 'collapsed localizer' approach, that is, by visually inspecting ERP waveforms averaged across all participants and all levels of quality [60, 61]: For study 1, windows 50–90 ms (P1), 90–140 ms (N1), 140–170 ms (P2) were selected; For study 2, windows 125–187.5 ms (P1), 187.5–260 ms (N1), 260–330 ms (P2) were selected.

Figure 4 shows collapsed localizer grand average ERP waveforms used for choosing the time windows. Each waveform demonstrated distinctive P1-N1-P2 and ACCs (second N1-P2). Moreover, in study 1 and 2 late sustained negativities at around 300 and 550 ms onward as well as late sustained positivities at around 525 and 700 ms onward were observable, respectively. Temporal-morphological component structures of P1-N1-P2 differed considerably between studies: Individual components in study 2 versus 1 appeared to be extended over time and possess more pronounced amplitudes (with exception of P1 being larger in study 1). The visibly different onsets of ERPs (starting later in study 2) were attributable to varying pre-signal silent periods in the speech files (compare oscillograms in figure 2).

Figure 4.

Figure 4. Collapsed localizer grand average ERP waveforms at electrode position Cz in study 1 (a) and study 2 (b). Gray areas mark time windows for parameter extraction. Error bands represent 95% confidence intervals.

Standard image High-resolution image

2.7. Data analysis

Statistical analyses were performed in R using 'nlme'3 and 'multcomp'4 packages. To analyze effects on CAEP parameters, linear mixed-effects models (LMEMs) with random intercepts were calculated as proposed by Tibon and Levy [65]: For study 1, nine LMEMs were computed with quality (HQ, N, C) as fixed factor and subject (1–28) as a random factor for each time window (50–90, 90–140, 140–170 ms) and CAEP parameter (peak amplitude, peak latency, mean amplitude). For study 2, another nine LMEMs were calculated with quality and word (/haus/, /maus/) as fixed factors and subject (1–34) as a random factor for each time window (125–187.5, 187.5–260, 260–330 ms) and CAEP parameter. A statistical significance level of α = 0.05 was assumed and Šidák-adjusted for the total number of 18 LMEMs. For statistically significant effects in LMEMs, additional post hoc general linear hypotheses with Holm correction were computed.

3. Hypotheses

The hypotheses derived for the present CAEP analyses built upon exploratory observations made in studies 1 and 2. Even though previous ERP parameter analyses had concentrated on P3, visual inspection of grand average ERP waveforms for target stimuli revealed significant negative deflections from baseline in earlier time ranges probably related to N1 and MMN components [17, 18]. More specifically, the coloration-impaired oddballs appeared to cause a stronger early negativity than the noisiness-impaired oddballs. It was therefore expected that also coloration-impaired standards would elicit a more pronounced N1, but no MMN as their sensory processing should not entail mismatch detection (hypothesis 1).

A recent series of experiments by Billings et al analyzed CAEPs to spoken syllables in continuous background noise. The authors established negative and positive relationships between noise level and N1-P2 amplitude and latency, respectively [6668] (with similar evoked response patterns being reported by [40, 56, 57, 6972]). Furthermore, interactions with regard to the spectro-temporal properties of different types of signals (tone, envelope signal, syllable) and background noise (continuous, modulated/interrupted, babble) were revealed [58, 73, 74]. Attempts have been made to separate evoked responses to continuous background noise from those to the speech signal by presenting the noise one second prior to the signal [57]: Indeed, different CAEP components occurred at noise onset and signal onset. This problem of overlapping evoked responses has been circumvented in studies 1 and 2, each of which applied signal-correlated instead of background noise [17, 18]. Assuming generalization of the aforementioned relationships involving background noise to signal-correlated noise used in studies 1 and 2, increased latencies and reduced amplitudes of N1 and P2 would be expected for noisiness-impaired versus high-quality stimuli (hypothesis 2).

Study 1 used only one word, /haus/, as stimulus material, while study 2 used two words, /haus/ and /maus/. Acoustic variation due to different voicing of the initial phoneme (i.e. voiceless glottal fricative /h/ versus voiced bilabial nasal /m/) had already manifested in behavioral task performance [17]: Presentations of noisiness-impaired /maus/ (N-/maus/) as targets provoked significantly faster and more accurate behavioral responses compared to the other stimuli. The timing and morphology of P1-N1-P2 was therefore expected to deviate between the two spoken words following a similar pattern: Auditory evoked responses to stimulus onset of /maus/ versus /haus/ would be faster and larger due to higher acoustic energy concentrated in the initial /m/ versus /h/ as seen in spectrograms of figure 2 (hypothesis 3).

Because of differences in stimulus material, induced speech quality degradations and oddball tasks, electrophysiological data from each study were analyzed independently, which is why no statistical comparisons across studies were conducted. Nonetheless, several sources of variability in CAEPs between the two studies could be anticipated. Presumably the most important factor, voice gender (female vs. male speaker) would be expected to cause variation in spectro-temporal voice characteristics (e.g. fundamental frequency) [75, 76]. For instance, a former ERP study on voice gender categorization found female and male voices to trigger larger P2 and N1 components, respectively [75]. Although stimuli were generated with the same quality impairment methods, different parameter settings were used, causing variation in impairment magnitude and perceived degradation intensity (see table 1; discussed in section 2.2). Consequently, CAEPs to different speech quality degradations might deviate between the two studies in other ways difficult to predict a priori.

4. Results and discussion

Based on visual inspection of grand average ERP waveforms shown in figures 5, 6 and 7, pronounced P1-N1-P2 complexes were elicited by all levels of quality (HQ = high-quality, N = noisiness-impaired, C = coloration-impaired). In addition, ACCs consisting of second N1-P2 components occurred within the ongoing stimuli, similar to earlier studies [35, 70]. Like in prior utilizations of standard trials from active oddball tasks, the initial N1 and the second N1 stood out as prominent negative components [58].

Figure 5.

Figure 5. Grand average ERP waveforms at electrode position Cz in study 1 (N = 28), split up by quality (HQ = black-dashed, N = purple-solid, C = green-dotdash). Gray areas mark time windows for CAEP parameter extraction. Error bands represent 95% confidence intervals.

Standard image High-resolution image
Figure 6.

Figure 6. Grand average ERP waveforms at electrode position Cz in study 2 (N = 34), split up by word (/haus/, /maus/) and quality (HQ = black-dashed, N = purple-solid, C = green-dotdash). Gray areas mark time windows for CAEP parameter extraction. Error bands represent 95% confidence intervals.

Standard image High-resolution image
Figure 7.

Figure 7. Grand average ERP waveforms at electrode position Cz in study 2, split up by stimulus (HQ-/haus/, N-/haus/, C-/haus/, HQ-/maus/, N-/maus/, C-/maus/), with line type coding for quality(HQ = dashed, N = solid, C = dotdash) and color coding for word (/haus/ = red, /maus/ = blue). Gray areas mark time windows for CAEP parameter extraction.

Standard image High-resolution image

Beyond the CAEP, collapsed localizer grand average ERP waveforms (see figure 4) demonstrated a late negative shift in baseline activity, possibly reflecting a sustained potential component triggered by auditory stimuli of longer duration [5, 77, 78]. Besides, an involvement of the so called processing negativity [89, 80] might be considered, which is functionally related to selective attention—i.e. the selection of a subset of the available sensory information (acoustic stimulus features) for in-depth perceptual and cognitive processing [80]—and even observable after behaviorally irrelevant stimuli [81]. Both late components might have been enhanced by oddball task requirements, demanding participants to actively concentrate on sensory detection and perceptual discrimination of targets from standards over the course of the test sessions.

Statistically significant results from CAEP parameter analyses for study 1 and 2 are summarized in tables 3 and 4, with corresponding results from post hoc analyses listed in appendix tables A1 and A2; mean plots of significant effects for study 1 and 2 are depicted in appendix figures A1 and A2, respectively.

Table 3. Effects of quality on CAEP parameters in study 1 (N = 28), extracted from different time windows. All listed effects are statistically significant with αSID = 0.002 8. Effect indices (#) point to correspondingly numbered subtables in appendix table A1 and mean plots in appendix figure A1.

# Extraction window CAEP parameter F[2, 14994] p
1 90–140 (N1) Peak amplitude 11.07 $ \lt 0.000\,1$
2 90–140 (N1) Mean amplitude 17.93 $ \lt 0.000\,1$
3 90–140 (N1) Peak latency 7.62 $ \lt 0.001$
4 140–170 (P2) Mean amplitude 6.54 0.001 5
5 140–170 (P2) Peak latency 24.40 $ \lt 0.000\,1$

Table 4. Effects of quality and word on CAEP parameters in study 2 (N = 28), extracted from different time windows. All listed effects are statistically significant with αSID = 0.002 8. Effect indices (#) point to correspondingly numbered subtables in appendix table A2 and mean plots in appendix figure A2.

# Extraction window CAEP parameter Effect $df_{\rm n}$ $df_{\rm d}$ F p
1 125-187.5 (P1) Peak latency Quality 2 20689 7.78 $ \lt 0.001$
2 187.5–260 (N1) Peak amplitude Word 1 20689 9.36 0.002 2
3 187.5–260 (N1) Mean amplitude Word 1 20689 11.24 $ \lt 0.001$
4 260–330 (P2) Peak amplitude Quality 2 20689 7.46 $ \lt 0.001$
5 260–330 (P2) Mean amplitude Quality 2 20689 10.34 $ \lt 0.000\,1$
6 260–330 (P2) Peak latency Word 1 20689 31.30 $ \lt 0.000\,1$

In study 1, auditory evoked responses to the coloration-impaired stimulus (C) were systematically shifted towards higher voltage negativity in relation to high-quality (HQ), being most apparent for N1 amplitude (see effects # 1, 2 in table 3 and figure 5). Simultaneously, N1-P2 responses to C were delayed across the CAEP component structure (see effects # 3, 5, ibid.). These findings confirmed hypothesis 1, which had originally been derived from exploratory observations in studies 1 and 2. The functional significance of this N1 modulation might lie in varying sensory encoding and analysis of spectrally distorted compared to clean speech (see conceptual framework in figure 1, discussed in section 1.2). Likewise, Obleser and Kotz found faster and stronger N1 responses for higher versus lower magnitudes of spectral distortion in spoken sentences induced through noise-vocoding [82] (a manipulation known to affect subjective quality ratings [83]). Interestingly, another study by Kong et al found the reverse pattern with smaller and delayed N1 responses for noise-vocoded versus intact continuous speech [84]. This discrepancy in N1 modulation patterns could be explained by different distortion magnitudes used in both studies, which might have altered sensory processing in listeners.

Unexpectedly, N1 and P2 amplitudes did not differ between the noisiness-impaired and high-quality condition (see effect # 4 in table 3 and figure 5) and peak latencies decreased for P2 (see effect # 5, ibid.), thus rejecting hypothesis 2 for study 1. It appears that the signal-correlated noise had influenced CAEP components differently than the various types of background noise studied before [6668]. Further support for such differential processing by the auditory system is based on the neuromagnetic counterpart of N1, the N1m of the auditory evoked field, measured by means of magnetoencephalography [10]. Application of the signal-correlated distortion type—in form of bit mode reduction to transiently masked spoken vowels—exerted a positive effect on N1m amplitude, which was stronger in the right hemisphere [85, 86]. The authors traced this lateralized enhancement of N1m back to an 'increase in spectral processing caused by the addition of noisy harmonic frequencies to the signal spectrum' (p. 10) [86].

Based on visual inspection, the evoked neural response pattern seemed to deviate between the initial N1-P2 and the ACC (i.e. the second N1-P2). Possibly the induced noise distorted the initial speech sound and the one within the stimulus to different degrees; specific response patterns for initial and second N1-P2 could reflect differences in sensory processing of initial and inner portions of the speech signal, respectively. Besides, whereas acoustic change evoking the initial P1-N1-P2 consisted of a transition from silence to speech signal, the one evoking the second N1-P2 was a speech sound transition within the ongoing signal, thus implying a 'reduced absolute acoustic level change' [68] (p. 2) for the latter.

In study 2, CAEPs differed between spoken words, especially the initial N1 and second N1 being more distinctive to /maus/ and /haus/, respectively (see effects # 2, 3 in table 4; see negative deflections of initial and second N1 at around 200 and 400 ms in figures 6 and 7). Moreover, a slightly faster P2 response was observed to /haus/ versus /maus/ (see effect # 6, ibid.). These findings partly confirmed hypothesis 3 (with regard to amplitude) and could be explained by higher acoustic energy concentration in the initial phoneme /m/ versus /h/ (compare initial, low-frequency portions of the spectrograms in figure 2c versus 2b); N1 amplitude enhancement in /maus/ would be expected to cause higher neural refractoriness, which in turn visibly attenuated the second N1 amplitude [68].

Based on the obtained results, hypothesis 1 had to be rejected for study 2, since presentations of coloration-impaired stimuli did not enhance N1. Despite that, the amplitude of the P2 (and visibly also the amplitude of the second P2) was significantly reduced by coloration-impaired stimuli (see # 4, 5 in table 4 and figure 6), much more distinctly in /maus/ (see figure 6, right plot) than in /haus/ (see left plot). This negative deflection of P2, combined with the second N1-P2, might reflect enhanced neural activity to detect the transition of speech sounds in /maus/; by contrast, presentations of coloration-impaired /haus/ apparently had no clear effect on P2.

Once more, hypothesis 2 was rejected for study 2 because noisiness-impaired stimuli did not modulate N1 and P2 compared to high-quality.

Unlike earlier studies on speech-in-noise [66], distinct P1 components emerged in the resultant ERP waveforms. In study 2, P1 was significantly delayed by coloration-impaired stimuli compared to high-quality and noisiness-impaired stimuli (see effect # 1 in table 4 and figure 6). As P1 components in healthy adults are usually small-sized [32], their occurrences might be attributed to a higher number of aggregated trials, improving the signal-to-noise ratio of grand average ERPs. Still, possible effects on early sensory filtering processes (gating) could not be precluded.

Observed variability in CAEPs between studies 1 and 2 underlined the potential role of talker gender and voice characteristics. This is particularly evident in case the same spoken word content was used, that is, /haus/ uttered either by a female (study 1) or a male voice (study 2). It might also be speculated that individual talker voice characteristics interacted with the kind of speech quality degradation [68].

Repeated presentations of physically identical standard stimuli might have made the auditory evoked responses susceptible to effects of neural refractoriness and/or habituation [49, 68, 8789]. Yet, since standard sequences were presented with jittered inter-stimulus intervals [90] as well as occasionally interrupted by oddballs, either form of response decrement should have been of negligible size and/or dissolved rather rapidly.

Returning to the initial question which motivated the present CAEP analyses in the first place: Were quality-related modulations in P3 characteristics discovered in previous ERP analyses (at least partially) resonating earlier modulations within the CAEP? The significant effects on quality described above seem to affirm this assumption: While in study 1 a strong negative deflection of N1 was observed, in study 2, the P2 was shifted in the negative voltage direction relative to high-quality (which was only clearly seen in /maus/). The previous analyses chose late-latency time windows to extract peak parameters from oddball-minus-standard difference waveforms. Accordingly, it could be claimed that early modulations were not directly affecting neural activity in these later time windows; moreover, the computation of condition-wise oddball-minus-standard difference waveforms would presumably have 'subtracted-out' CAEPs, thus leaving only late neural activity associated with perceptual and cognitive processing of oddball stimuli [91]. However, these claims implicitly assume that the CAEPs triggered in standard and oddball trials were not systematically deviating from each other. Also, regardless of whether the differencing procedure truly eliminated early exogeneous portions of the ERP signal, quality-related modulations in standard trials might have propagated on to later time ranges occupied by endogeneous components (P3a, P3b) in oddball trials. Some of the obtained ERP waveforms seemingly support this supposition (e.g. see C in figure 5, C-/maus/ in figure 6).

5. Conclusion

The present study investigated whether auditory sensory processing of transmitted speech signals (spoken words) is already systematically affected by speech quality degradations ('noisiness' due to signal-correlated noise, 'coloration' due to bandpass filtering, see [29, 55]) in comparison to high-quality. Electrophysiological data from two previous studies using active oddball paradigms were examined [17, 18]. Overall, analyses of cortical auditory evoked potentials (CAEPs) yielded the following main results:

First, ERP waveforms with distinctive CAEP portions were obtained by grand averaging standard trials from active oddball tasks; acoustic change complexes (ACCs) comprising second N1 and P2 components manifested within the ongoing word stimuli.

Second, CAEPs differed between the two spoken German words (/haus/, /maus/) used as stimulus material (study 2).

Third, a faster and enhanced initial N1 was evoked by coloration-impaired versus high-quality stimuli (study 1).

Fourth, a reduced initial P2 was evoked by coloration-impaired versus high-quality stimuli (study 2); this effect appeared to be dependent on the particular spoken word stimulus (/haus/ vs. /maus/).

These results imply an influence of spectral distortions (induced through bandpass-filtering) on sensory processing, possibly altering processes of sensory encoding and analysis of acoustic stimulus features, as indicated by N1-P2.

6. Future outlook

The existence of significant quality-related modulations in N1 and P2 components make potential carry-over effects of early CAEP modulations on later P3 modulations more probable. Hence, a first step is reached for identifying the 'locus of effect' of experimental manipulations of speech transmission quality along the internal information processing chain [59]. As a next step, revised analyses of P3 and its subcomponents (P3a, P3b) will be necessary, this time controlling for sensory contributions reflected by the P1-N1-P2 complex and ACC (second N1-P2). Specific differencing procedures originally developed for the analysis of the MMN [6, 32, 33] could be adopted to verify previously inferred effects of perceptual quality references on P3 response characteristics [17, 18]. For instance, the so called 'flip-flop' control procedure, used for subtracting-out auditory evoked activity in ERP waveforms [6, 32, 33], has already been successfully tested on a subset of the electrophysiological data from study 2 [92].

The question of whether auditory evoked responses were also modulated by talker voice characteristics—a female and male voice uttered the speech stimuli in study 1 and 2, respectively—or even interacted with the induced quality impairments, could only be loosely tackled by visually comparing ERP waveforms across the two studies. Future experiments will be needed to rigorously test for effects of voice (gender) on the indication of speech quality degradations by the P1-N1-P2 complex. It is assumed that the validity, reliability and sensitivity with which speech quality change can be tracked at early time ranges will critically depend on voice characteristics as well as variability in transmitted speech content (due to variation in acoustic surface form of the speech signal, not in its semantic meaning [1, 93]).

Systematic investigation of the CAEP might further be continued by varying the magnitude of quality impairments during speech transmission (e.g. different levels of signal-correlated noise [14]), which is perceived as varying degradation intensity. This manipulation would allow to directly correlate CAEP component parameters with commonly used subjective quality ratings [94]. It might further be tested whether and to what degree auditory evoked responses are capable of capturing quality-impaired stimuli below individual thresholds for conscious perception (as determined by psycho-acoustic techniques) [95, 96]—even though, prior evidence suggested that CAEP-range neural activity might actually be less sensitive to sub-threshold impairment magnitudes than late ERP components [14]. Future analyses of CAEP parameters might be extended to single-subject and single-trial levels by following an analytic approach proposed by Porbadnigk et al [14], potentially offering higher reliability and sensitivity.

Besides, in an attempt to increase the ecological validity of testing scenarios, more complex speech stimuli like longer words or sentences could be presented [13], which should evoke multiple acoustic change complexes (ACCs to speech sound transitions, silence period onsets/offsets) within the ongoing stimuli. The speech signals might be overlaid with environmental sounds or different types of ambient noise [1] to approximate real-life adverse listening conditions. Finally, speech quality impairments more common in modern telecommunication networks based on the Internet Protocol (IP) could be induced (e.g. packet/frame loss [3, 17]).

Testing the utility of neurophysiological measures for non-intrusive assessment of immersion factors (e.g. sound source localization due to spatial audio [97], perceived naturalness due to fidelity manipulation [23, 24, 98]) and their interrelations with perceived quality might open up promising new research directions. In view of immersive visual 3D technology, for example, Avarvand et al [99] recently proposed the visual P1 component as an indicator of vertical disparity, an important influencing factor on the perceived quality of stereoscopic images [100]. An analogous identification of neurophysiological measures in the auditory modality, with the P1-N1-P2 complex as potential candidate, could prove useful for future psychophysiological evaluation of immersive audio and speech communication technologies [101].

Methodologically, the present study offers an alternative analysis approach for ERP-based assessment and evaluation of speech transmission quality. Due to being triggered by rare stimulus events, late endogeneous ERP components like the mismatch negativity (MMN) and P3 typically require long test sessions in order to gather enough oddball trials and achieve a sufficient signal-to-noise ratio for the ERP signal [102], whereas CAEPs can be extracted in much shorter durations. Several studies have examined speech-in-noise effects on the CAEP in the context of active oddball paradigms [40, 5658, 69, 71] as well as during passive listening to repeated presentations of the same stimulus, the so called homogeneous paradigm [40, 58, 66, 70, 71]. A comparison of standard and homogeneous stimulus trials revealed systematic differences in timing and morphology of the resulting CAEPs [58], which accentuates the importance of measurement (incl. task) context. Active oddball tasks demand perceptual discrimination of target and non-target stimuli and hence involve more or less effortful cognitive processing (e.g. controlled attention, perceptual categorization, behavioral response initiation). Even though being primarily exogeneous in nature, CAEP components might to some degree be influenced by top-down factors, variation in sustained attention and arousal level due to long-lasting task engagement [103, 104]. N1 modulations by speech-in-noise were obtained even at noise levels as low as to not allow for perceptual discrimination (as indicated by N2/P3 responses and behavioral task performance) [40, 56, 6971]. Thus, tasks without the requirement to perceptually discriminate quality change, such as passive oddball tasks or homogeneous stimulus presentations, might be employed in future studies and tested against the more established active oddball paradigms.

A parallel development towards more efficient and easier-to-implement protocols for psychophysiological quality assessment has been initiated in the domain of visual and audio-visual media (image, video with/without sound) [105]. This line of research started out similarly by measuring the visual P3 component in active oddball paradigms [94, 96, 106]. More recently, a specific type of visual evoked responses called steady-state visual evoked potentials (SSVEPs) came into focus, whose amplitude features demonstrated close correlations with image/video quality impairments [102, 107, 108]. SSVEPs are visual evoked responses triggered by repetitive, fast flickering visual stimulation at certain frequencies and functionally associated with sensory processing in visual cortical areas [109, 110]. The present study closes a corresponding methodological gap in the domain of transmitted speech, by introducing neurophysiological measures extracted from the CAEP to gauge the impact of speech quality degradations at the sensory processing stage of the auditory system.

As pointed out in previous research on image and video quality [96, 102, 107], the triggering of P3 (and MMN) requires oddball events (i.e. transitions from frequent to rare stimuli), which is why it actually reflects sensorily detected and/or perceptually discriminated quality change rather than perceived quality per se. Since cortical visual and auditory evoked potentials can already be triggered by homogeneous sequences of physically identical stimuli, their usage would cease the necessity to adapt oddball paradigms for the purpose of quality assessment [14]. Furthermore, active oddball paradigms require participants to discriminate standard-oddball transitions which is taxing on cognitive processing resources [7]. It might be argued that these cognitive load effects confound sensory processing of stimulus (quality) features which is more immediately tied to perceived quality [102]. However, theoretical opinion still diverges on the exact role and weight of cognitive processing in internal quality formation, defining 'quality' as either an evaluative attribute of conscious perception [17] or as only the outcome of a later cognitive judgment process [2, 3, 20] (see section 1.1).

Aiming at an improvement of psychophysiological speech quality assessment in the field, the combination of (1) data collection with a minimal EEG recording setup (e.g. besides electrodes for referencing, grounding and tracking of ocular activity, placing only a single electrode at Cz), (2) usage of the homogeneous paradigm for stimulus presentation, and (3) data processing and analysis of CAEP promise shorter testing durations and higher efficiency—though, this presupposition as well as essential measurement criteria (validity, reliability, sensitivity, diagnosticity) need to be thoroughly elaborated in future studies (see [6, 46] for test-retest reliability of the P1-N1-P2 complex; see [18] for a related discussion centering around P3-based indication of speech transmission quality).

Acknowledgments

This work was supported by the strategic partnership program between Technische Universität Berlin, Germany, and the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway. It was conducted during a one-year research stay of author SU at NTNU. We would like to thank the anonymous reviewers for their valuable comments, which amended the final version of the article.

Authors' contributions

SU, AP and DMB set the scope of the study, leading to more detailed research goals and hypotheses. SU implemented and carried out the analyses. SU wrote the draft article, with AP and DMB providing feedback. All authors read and gave their final approval of the article.

: Appendix

Figure A1.

Figure A1. Mean plots for effects of quality on CAEP parameters in study 1. For N1, absolute amplitude values are displayed to improve readability. Error bars represent 95% confidence intervals. Plot numbers point to corresponding effect indices (#) in table 3.

Standard image High-resolution image
Figure A2.

Figure A2. Mean plots for effects of quality on CAEP parameters in study 2. For components with negative polarity (N1, CN1), absolute amplitude values are displayed to improve readability. Error bars represent 95% confidence intervals. Plot numbers point to corresponding effect indices (#) in table 4.

Standard image High-resolution image

Table A1. Post hoc contrasts for effects of quality on CAEP parameters in study 1. Asterisks indicate statistically significant contrasts ($p \lt 0.05$).

(1) ContrastEstimateSEMtp
HQ - N −0.28 0.24 −1.14 0.25
HQ - C 0.91 0.24 3.73 $ \lt 0.001^\ast$
N - C 1.19 0.27 4.45 $ \lt 0.000\,1^\ast$
(2)
Contrast Estimate SEM t p
HQ - N −0.41 0.22 −1.82 0.07
HQ - C 1.01 0.22 4.50 $ \lt 0.000\,1^\ast$
N - C 1.41 0.24 5.77 $ \lt 0.000\,1^\ast$
(3)
Contrast Estimate SEM t p
HQ - N 0.66 0.34 1.93 0.05
HQ - C −0.81 0.35 −2.34 $0.04^\ast$
N - C −1.47 0.38 −3.90 $ \lt 0.001^\ast$
(4)
Contrast Estimate SEM t p
HQ - N −0.50 0.26 −1.94 0.09
HQ - C 0.52 0.26 2.03 0.09
N - C 1.02 0.28 3.62 $ \lt 0.001^\ast$
(5)
Contrast Estimate SEM t p
HQ - N 1.05 0.21 4.96 $ \lt 0.000\,1^\ast$
HQ - C −0.54 0.21 −2.54 $0.01^\ast$
N - C −1.60 0.23 −6.84 $ \lt 0.000\,1^\ast$

Table A2. Post hoc contrasts for effects of quality on CAEP parameters in study 2. Asterisks indicate statistically significant contrasts ($p \lt 0.05$).

(1)ContrastEstimateSEMtp
HQ - N 0.73 0.36 2.02 0.04*
HQ - C −0.90 0.36 −2.51 0.02*
N - C −1.63 0.42 −3.92 $ \lt 0.001\ast$
(4)
Contrast Estimate SEM t p
HQ - N −0.14 0.39 −0.37 0.71
HQ - C 1.35 0.39 3.49 $ \lt 0.01\ast$
N - C 1.50 0.45 3.35 $ \lt 0.01\ast$
(5)
Contrast Estimate SEM t p
HQ - N −0.08 0.38 −0.22 0.82
HQ - C 1.59 0.38 4.19 $ \lt 0.000\,1\ast$
N - C 1.67 0.44 3.82 $ \lt 0.001\ast$

Footnotes

  • Another advantage of analyzing multiple ERP amplitude parameters lies in the fact that peak amplitude strongly depends on the noise level of the ERP signal [60, 61]. Mean amplitude, on the other hand, is more robust to varying noise level and thus offers an alternative amplitude measure with which to contrast experimental conditions.

  • http://psychtoolbox.org/.

  • https://cran.r-project.org/web/packages/nlme/.

  • https://cran.r-project.org/web/packages/multcomp/.

Please wait… references are loading.
10.1088/1741-2552/ab93e1