Methods and evaluation of physiological measurements with acoustic stimuli — a systematic review

Objective. The detection of psychological loads, such as stress reactions, is receiving greater attention and social interest, as stress can have long-term effects on health O ’ Connor, Thayer and Vedhara ( 2021 Ann. Rev. Psychol. 72 , 663 – 688 ) . Acoustic stimuli, especially noise, are investigated as triggering factors. The application of physiological measurements in the detection of psychological loads enables the recording of a further quantitative dimension that goes beyond purely perceptive questionnaires. Thus, unconscious reactions to acoustic stimuli can also be captured. The numerous physiological signals and possible experimental designs with acoustic stimuli may quickly lead to a challenging implementation of the study and an increased dif ﬁ culty in reproduction or comparison between studies. An unsuitable experimental design or processing of the physiological data may result in conclusions about psychological loads that are not valid anymore. Approach. The systematic review according to the preferred reporting items for systematic reviews and meta-analysis standard presented here is therefore intended to provide guidance and a basis for further studies in this ﬁ eld. For this purpose, studies were identi ﬁ ed in which the participants ’ short-term physiological responses to acoustic stimuli were investigated in the context of a listening test in a laboratory study. Main Results. A total of 37 studies met these criteria and data items were analysed in terms of the experimental design ( studied psychological load, independent variables / acoustic stimuli, participants, playback, scenario / context, duration of test phases, questionnaires for perceptual comparison ) and the physiological signals ( measures, calculated features, systems, data processing methods, data analysis methods, results ) . The overviews show that stress is the most studied psychological load in response to acoustic stimuli. An ECG / PPG system and the measurement of skin conductance were most frequently used for the detection of psychological loads. A critical aspect is the numerous different methods of experimental design, which prevent comparability of the results. In the future, more standardized methods are needed to achieve more valid analyses of the effects of acoustic stimuli.


Introduction 1.Listening tests
Listening tests are a popular tool for evaluating the sensations of listening to different sounds.On the one hand, there are the physical quantities, such as the sound pressure level, which can be used to describe a sound.However, these purely physical quantities of a sound are not sufficient to describe how the respective sound affects the listener (Zwicker and Fastl 2013).This aspect, on the other hand, can only be addressed by involving the listener, namely in listening tests.In these experiments, different sounds are presented to the participants in controlled conditions with the intend to gain information about sensation-related attributes of the sounds.Subsequently, statistical tests help to evaluate the test results and allow conclusions about these effects of the sounds on the listener.
In order to obtain meaningful results in a listening test, the design of the test must be adapted to the respective research question.An important aspect are the stimuli, for which the duration and especially the way of presentation have to be determined.Here, different playback setups can be used, ranging from headphones to complex loudspeaker arrays.For ideal reproduction quality, these sound setups might be equalized to reduce unwanted distortions.In addition, different established measurement methods are available for a listening test (Parizet et al 2005).These include, for example, the paired comparison, in which two sounds are compared directly with each other.This offers the advantage of detecting small differences between sounds.Another method is the semantic differential, in which several attributes of a sound are evaluated on multiple scales to obtain more information about a single sound.Within the different methods, the evaluations regarding different sound attributes take place.The choice of attributes depends entirely on the application of the listening test.For example, in the design of a product sound using a listening test, sound quality could be used as an attribute, whereas in the field of noise research, attributes such as perceived disturbance or annoyance are the main focus (Genuit 2010, Pierrette et al 2015).
Another aspect to be mentioned is the inclusion of context.Listening tests range from presenting the target sounds only to presenting the target sounds embedded within other environmental sound sources, to create realistic sound environments (Fiebig 2015).These are the scenarios in which the sounds occur in the real world.This can be a specific everyday life environment or outdoor scenarios that are presented in virtual reality, for example (Schoeffler et al 2015).At a higher level, context is included by assessing sounds in the real world, outside the laboratory.A method for this kind of listening tests is, for example, the soundscape approach according to ISO Standard DIN ISO/TS 12913-3:2021-06 (DIN ISO/TS 12913-3:2021-06 2019).
For a better overview, table 1 describes the previously listed criteria of the methodology of a listening test according to Otto et al (2001).The selection of the stimuli and the analysis of the results are also part of a listening test but are not counted as part of the methodology here.
In order to be able to deal more precisely with the research questions that influence the experimental design, it is necessary to take a look at the various fields of application of listening tests.On the one hand, listening tests are used in the industry to optimize products or product sounds (Genuit 2010).On the other hand, listening tests are used in basic research.Topics that are investigated include the development of psychoacoustic parameters that are intended to describe the relationship between the human perception of sounds and their physical quantities (Zwicker and Fastl 2013).Further topics are the investigation of speech intelligibility (Reimes 2021) or the influence of sounds on cognitive processes, for example, in noise research.Especially for the latter topic often a specific design of the listening test is used (Schlittmeier 2021).Here, the participants complete cognitive tests while being presented with different sounds.The cognitive tests are used to investigate whether certain sounds have an influence on cognitive performance.

Physiological measurements in listening tests
The previously mentioned methods in listening tests are all based on perceptual assessments, i.e. on the participants' responses.There are also various methods for the special recording of physiological reactions, in order to record them via self-reports by the participants.Standardized questionnaires are used for this purpose.These include, for example, the positive and negative affect schedule (PANAS), which allows ten attributes of positive and negative feelings to be assessed.In addition, the cohens perceived stress scale is another questionnaire that directly addresses the perceived stress response.Since such questionnaires can take more time to complete, simpler scales can also be used, such as the visual analog scale (VAS), where the participant indicates the perceived stress level on a visual unmarked scale 100 mm in length.Even though there is a good understanding of how to implement such assessments, for example, from the fields of psychology and sociology, biases in perceptual ratings cannot be eliminated completely (Zielinski et al 2008).It is known that in ratings of sounds on scales, scale effects can occur that distort the results.Especially in the case of stress responses, it has been shown that participants are often not able to describe these responses adequately.For example, in our previous research (Laufs et al 2021) we showed that the self-assessed stress response can deviate from the physiologically measured one.Also, Spreng et al (2000) was able to show that such physiological reactions can also take place without the conscious awareness of the participants and thus are not necessarily captured by questionnaires.
To counteract this, physiological measurements can be used in the experimental design.They are able to record conscious as well as unconscious physiological reactions and allow conclusions to be drawn about the physiological mechanisms and influence of sounds on the human body (Charles and Nixon 2019).At the same time, physiological measurements allow early detection of stress responses and also provide information about the duration of the response, since they are recorded continuously throughout the experiment.Using questionnaires yields more single-point information about the listener's sensations.Likewise, physiological measurements do not interrupt the course of the experiment, as questionnaires might do after each stimulus.

Physiological measurements
Acoustic stimuli have the potential to evoke physiological reactions.These reactions are reflected in changes in brain activity and especially in the autonomic nervous system (ANS) (Henry 1993).The ANS is composed of the sympathetic nervous system (SNS) and the parasympathetic nervous system (PSNS) (Brodal 2004).While the SNS activates the human body and can put it into an alarm state, the PSNS is responsible for rest and regeneration of the body.External influences, such as sounds, can result in activations of the SNS and the PSNS.
Here, not only one of the two is active, it is rather a constant interaction of the two nervous systems.Activations of the two nervous systems in turn affect physiological functions, such as the heartbeat or breathing.Acoustic stimuli perceived as unwanted noise can, for example, cause a stress reaction in the body, activate the SNS and lead to an increase in heart rate or sweat production in the palms of the hands (Laufs et al 2021).Physiological measurements enable the recording of these physiological changes with the help of sensors, which are attached to the participant during an experiment.By evaluating the signals from the sensors appropriately, conclusions can be drawn about the activation of the nervous systems or brain activity.These in turn provide information about possible psychological loads experienced by the participant (Charles and Nixon 2019).Karthikeyan et al (2011) indicated in their review that various stress-inducing stimuli, such as the stroop colour word test or the cold pressure test, induce changes in physiological parameters.This resulted in increases in skin conductance or in a decreased heart rate variability (HRV).An increased mental workload could be related to changes in pupil diameter during the processing of different cognitive tasks using the physiological measurements by Charles and Nixon (2019) and Marinescu et al (2018).

Purpose and scope of this review
For the reasons stated above, numerous studies rely on the use of physiological measures to investigate acoustic stimuli.The many different physiological sensors and different methods of listening tests result in a large number of possible experimental designs for studying physiological responses to sounds.However, there is a possibility that the large number of methods may cause difficulties in developing a suitable experimental design.A structured overview of the individual methods and physiological signals along with their evaluation therefore seems necessary to provide a basis and orientations for further investigations in this field.Therefore, this review is intended to contribute to the development of experimental designs in the future that allow valid conclusions to be drawn about the measured responses.Its purpose is to present clearly the current state of research of physiological measurements to assess the influence of acoustic stimuli, to discuss it critically, and to extract resulting research gaps.On the one hand, the focus is on the used physiological sensors and which ones have proven to be particularly suitable for certain research questions.This also includes the processing and analysis of the respective physiological signals.On the other hand, methods and experimental designs will be presented which allow the simultaneous acquisition of physiological data.Here, the use of physiological measurements requires certain changes in methods and in the design.In this review, different research questions from past studies will be presented and it will be shown with which sensors, experimental methods, algorithms, and analysis methods these questions could be answered in a valid way in the past.

Methods
The search of the literature conducted in this work and the subsequent reporting of the results to construct this review paper were based on the preferred reporting items for systematic reviews and meta-analysis (PRISMA) standard which also predefines the structure of this review (Page et al 2021).A checklist of this standard specifies for each section of the review which aspects it must contain in order to improve the quality and transparency of reviews.Following the guidelines from this standard, papers were searched that addressed the use of physiological measurements in combination with a listening test respectively with acoustic stimuli.

Search strategy
For this purpose, the databases PubMed of the National Centre for Biotechnology information (NCBI) and IEEE Xplore of the Institute of Electrical and electronics Engineering (IEEE) were searched.The databases were searched in such a way that terms indicating the use of acoustic stimuli were logically associated with an AND with the different physiological signals.In total, the final search query is described in figure 1.
After an initial brief screening, additional exclusion terms were defined that occurred in several records that did not belong in the shortlist of this review paper.These include the terms 'heart/cardiac sound', 'lung sound', 'ultrasound', 'cough sound', or 'stutter'.The search was conducted on 8/31/2022.For the IEEE Xplore database search, the term noise was replaced with several more specific noise terms related to acoustics, such as noise exposure, traffic noise, or background noise.Search for the exclusive term noise here would have produced 3478 results, many of which were in the domain of noise as an artefact in physiological signals.Accordingly, in the case of the search with IEEE Xplore, the search was adjusted.In the PubMed database, the search remained unchanged.In each case, databases were searched only for records published within the last 15 years, i.e. 2007 and later.In addition, terms indicating the use of acoustic stimuli were searched for only in the title of the records.The search in the PubMed database was limited to the search for records on human species with the help of filters.Using these filters, studies in infants and young children were also excluded from the search.

Selection process and eligibility criteria
The results of the search were collected using the literature management program JabRef (JabRef e.V., Germany).Duplicates were removed in the program, and the remaining records were screened for the final selection of reports.During screening, the following criteria were considered as exclusion criteria: Records were screened out if they were unrelated to acoustics and papers that only discussed physical responses or health effects but did not address physiological measurements.In addition, records had to be written in English.Records were excluded that used stimuli other than acoustic stimuli, examined only diseased (for example, hearing-impaired) participants, or presented only algorithms but did not apply them in a study.Studies examining responses to sounds with an fMRI system or in field studies were also not eligible for the final selection of this work.With all these exclusion criteria, the search results were narrowed down to reports that examined participants' short-term physiological responses to acoustic stimuli in the context of a listening test in a laboratory study.
The exact criteria were defined in detail prior to screening in order to minimize the risk of bias and to ensure that records could be excluded from this review paper based on a objective and reproducible decision.

Data collection process
The next step in the PRISMA standard included a screening for data items in all reports that met the eligibility criteria.Using these data items, information was systematically extracted from the individual reports.For this purpose, the following data items regarding the experimental design and the physiological data were defined with the help of the overview of listening test methods in table 1 and based on the aim and scope of this review: • Experimental Design: According to the data items mentioned here, information from the reports was identified and summarized.In case of missing information in a paper for a certain data item, this paper was not included in the evaluation of the respective data item.

Screening process
An overview documenting the screening process according to the PRISMA standard in its individual steps can be seen in figure 2.
The search in the two databases revealed a total of 2771 results.After verification by the literature management program, three results were identified as duplicates, so that 2768 records were accepted for the screening process.Based on the titles of the results, further 2558 records could be excluded in the screening process.This was done according to the criteria mentioned above.The remaining 210 records were screened based on their abstracts.From the resulting 51 full text reports, a further 14 were excluded where a more detailed description of the experimental design indicated that the study did not meet the selection criteria for this review paper.Among the 210 remaining records, studies were eliminated that used physiological measurements to investigate the effects of noise on humans but did so in the form of long-term studies.The 54 long-term measurements were made in everyday life of the participants at their home or at their workplace.In addition, 44 studies were screened out that examined the physiological response to acoustic stimuli comparing diseased participants with a healthy control group.For example, some studies recruited participants who suffered from hearing impairment and used a hearing aid or a cochlear implant.Twenty-four studies used an EEG to examine response to auditory stimuli, focusing on analysis of auditory evoked potentials.They were screened out since here mechanisms in the brain were primarily observed and no conclusions were made about the participants' psychological load, such as stress or attention.Further, 17 sleep studies, 11 studies on music therapy, and 23 studies in which the acoustic stimuli were not the focus of the work were screened out.Thus, the final selection adopted in this paper is narrowed to 37 studies.

Physiological measures
For a better understanding of subsequent overviews, the various physiological measures of the 37 studies were first collected.A list of all measures can be found in table 2, which also lists the features calculated from the respective physiological signals.The third column describes the measurement systems and where the sensors are usually placed on the participants' bodies.
Table 3 shows all 37 studies that were identified as the results of the screening process presented together with the respective physiological measures and their features.As can be seen in figure 3, an ECG or PPG with the measurement of heart rate and HRV parameters was most frequently used to investigate the physiological response to acoustic stimuli (15 times).Measurements of EDA also found their application in 14 of the 37 studies.EEG measurements with the subsequent calculation of the band power of the individual EEG frequency bands were used in nine studies, pupillometry in seven studies.The other measures, such as EMG, blood pressure, respiration, and temperature, were used only sporadically (5).

Psychological load
In order to understand the motivation with which the physiological measurements were used in the studies, attention was paid especially to the respective psychological load investigated in the studies.Here, the physiological measurements are used to draw conclusions about the intensity of the psychological load and thus to obtain information about the relationship between the load and the acoustic stimuli.The third column of table 3 provides an overview of the loads.Psychological loads that have been examined in the 37 studies include Stress, Recovery, Annoyance, Emotion, Attention, and Mental Workload or Listening Effort.The following is a brief explanation of each term: ▪ Stress refers to an activation of the SNS that puts the body in a state of alert and prepares certain bodily functions for a possible fight or flight from the stressor (Henry 1993).
▪ Recovery, on the other hand, refers to the time after a stress response and describes how the body comes to a state of rest.
▪ Annoyance is used to describe the sensation of unpleasantness due to disruptive stimuli (Zwicker and Fastl 2013).
▪ Emotions are more complex mental states that can have positive as well as negative characteristics (Lee et al 2010).The term describes a variety of states, where emotions such as, joy, sadness, anger, fear, surprise, and disgust are commonly researched.
▪ Attention is used to explain concentration on a particular stimulus, with distractions being blocked out (Alvarsson et al 2010).
▪ Mental Workload describes the mental effort required to successfully perform a specific task (Kristiansen et al 2009).Listening Effort refers specifically to the mental effort required to understand speech.
The frequency of use of the respective psychological load can be seen in figure 4. The measurement of stress reactions to acoustic stimuli stands out, which was used in 19 of 37 studies.
The third column lists the results of the study, with significant changes in features listed here.Features not listed among the results do not show significant changes in relation to the independent variables of the studies.
In addition to table 3, figure 5 shows a histogram of the years in which the studies were published.

Acoustic stimuli
Further details on the respective independent variables can be found in table 4.This table lists which acoustic stimuli were used and which physical properties the stimuli were characterized by.17 studies used real environmental record as acoustic stimuli.Of these, eight studies used traffic noise and two studies used Another six studies examined the effect of background speech on the physiological response.For the playback of the acoustic stimuli, 17 studies used loudspeakers, 16 studies used headphones, and four studies did not specify in which way the stimuli were presented.Five of the studies that used loudspeakers additionally made recordings with an artificial head to determine or verify the physical and psychoacoustic properties of the stimuli at the participant's position.In the study by Trimmel et al (Paunović et al 2014), the loudspeakers were placed outside the experimental room so that the participants did not know that the stimuli were part of the experiment.
Table 4 also describes the particular scenario in which the acoustic stimuli were presented.A comprehensive summary of these scenarios is presented in figure 6.In 27 studies, these were different cognitive tasks that the participants had to perform while the stimuli were presented to them.Of the 27 studies that used cognitive tasks, 18 studies used a standardized cognitive test.Among them, a visual working memory task or a visual recall task was used four times.The N-Back task (Kirchner 1958) was used in a total of three studies and mental arithmetic tasks in two studies.
For a quicker overview, the last column of table 4 again shows the psychological load that was studied and the physiological measures with which the respective acoustic stimuli were investigated.

Measurement systems
Different sensor systems were used in the individual studies to record the physiological data.Table 5 describes a list of systems that have been used in more than one study.In contrast to the stationary devices, two studies used wearable devices that are attached to the wrist of the participants.Here, the FitBit Charge HR2 (FITBit Inc., San Francisco, USA) was used as PPG (Wang et al 2022) and the E4 Wristband from Empatica (Cambridge, USA) was used to measure EDA (Książek et al 2021).

Data processing and experimental design
The screening process also revealed information about the processing of the physiological data and about the respective experimental design.Table 6 provides an overview describing these attributes of the individual physiological measures.
Only the most frequently used measures were included in the list, since the overviews of the other measures would have been based only on a small number of studies.The second column describes the different programs or toolboxes used by the studies for signal processing.The third column first lists the common processing steps used in the studies for each physiological signal.The steps are ranging from the raw signal to the calculation of the respective features.Subsequently, steps are listed among the further processing, which only found their application in individual studies.The fourth column provides information on the experimental design by summarizing the times of the experimental phases of the individual studies.This includes information on the duration of the baseline measurement, how long a stimulus was presented, and the duration of the rest periods between presentations.
To gain further information on the experimental design, the number of participants in each trial was extracted.In this regard, a mean of 36 (SD: 28) participants took part in the trials.The numbers of participants ranged from 10 to 130 participants.To ensure that healthy hearing was present in the participants, 16 of the 37 studies performed audiometry prior to the start of the experiment.In the study by Sim et al (Holube et al 2016), the health of the auditory system was additionally checked with an otoscopy.In six studies the participants were divided into groups according to the between-subject design (Paunović et

Perceptual comparison
The physiological measurements provide conclusions about the psychological load of the participants.On the other hand, in studies the loads are also recorded via perceptive surveys in order to obtain a comparison for the physiological data.This comparison allows conclusions to be drawn about the extent to which the physiological measurements reflect the consciously perceived loads.Furthermore, it can be verified in which cases psychological loads are only detected by physiological measurements.Most frequently, a VAS was used in the studies of this review, in which the respective psychological load is assessed by the participant on a line.For example, the line ranges from 'no stress' to 'maximum stress' and participants indicate their current stress level by clicking on that line.A total of six studies made use of a VAS for perceptual assessment (Cacioppo et (Hart 2006).The test was originally intended to measure load in work environments such as a cockpit.NASA TLX consists of six questions that capture different dimensions of load, and participants answer on scales that range from 'low' to 'high.'As an alternative to the questionnaires, two studies used verbally answered scales to determine perceptual load (Walker et al 2016, Ellermeier et al 2020).
In addition to recording the perceptual load, four studies used the Weinstein Noise Sensitivity Scale  Table 4. Overview off the independent variables and acoustic stimuli of the 37 studies together with the scenario in which the stimuli were presented.For a better overview again the investigated psychological load and the used physiological measure are shown (Stress (Str.),Attention (Att.),Annoyance (An.), Emotion (Em.), Listening Effort (LE), Recovery (Re.), Cognitive Load (CL).

Study
Independent before the start of the actual experiment.Here, sensitivity to different types of noise is recorded via 16 questions (Worthington 2017).The questions are each answered on a five-point scale.As a conclusion, a sensitivity score for each participant can be calculated from these answers.In each of the studies, the WNSS was answered by the participants before the start of the actual experiment.

Physiological measures
The results of the literature search indicate that all studies investigate the extent to which differences in the psychological load caused by acoustic stimuli are reflected in the physiological measures.The measures in table 2 which were used for this purpose represent all physiological measures that are also used in other literature Figure 6.Overview of the selected scenarios in which the acoustic stimuli are presented, dividing the studies in studies which use a cognitive task and studies where participants just listen to the sounds.
Table 5. Overview of the measurement systems that were used in more than one study.Table 6.Overview of programs and toolboxes that were used for the processing of the physiological data together with common steps of the data processing for each physiological measure.According to the experimental design, durations of the different test phases are listed.reviews for the analysis of physiological responses.For example, Alberdi et al (2016) summarize studies dealing with stress detection in offices and lists the same measures highlighted in this review.Likewise, Palumbo et al (2017) concluded that the measures presented in this review represent the most commonly used measures.His systematic review summarizes studies that use physiological measurements to investigate interpersonal autonomic physiology.Only the measurement of skin temperature was listed as an also frequently used parameter in the reviews above, but in conjunction with acoustic stimuli the measurement was used only once.However, the lack of analysis of skin temperature revealed in this review is consistent with the findings by Alberdi et al (2016).Here, studies have shown that skin temperature does not provide much information about participants' emotions or stress.Not only the type of measures is comparable to other reviews, but also the frequency of use of each measure is similar to results of other reviews.As in this work ECG/PPG (n = 15, 40.5%) and EDA (n = 14%, 37.8%) are the most used measures, they are also the most frequently used ones in the review by Charles and Nixon (2019).(ECG/PPG: 38.6%) for the investigation of mental workload and in the review by Alberdi et al (2016) (ECG/ PPG: 39.4%, EDA: 27.3%).

Tools
In addition to the information on the use of measures, this systematic review also summarizes the results of the individual studies, presents them in table 3, and relates them to the physiological measures.In the following, a comparison of these results presents an overview of which features showed significant changes in response to manipulation by the acoustic stimuli.To this end, however, it should first be clarified that the comparability of the results is limited.Since the context or methods in each of the studies differ greatly, the comparison of results should be viewed with caution.Thus, the type of features may only be one of many factors that play their part in determining whether acoustic stimuli elicit significant differences in these features.
Among the various features, a significant difference was most frequently observed in the tonic features, (slowly changing features) of the EDA measurements.These include skin conductance level (SCL) and frequency of SCRs (fSCRs), which showed significant differences in n_sign = 7 studies.In total, these features were analysed in n_used = 9 studies, so a significant difference was observed in 77.8% of these studies.Likewise, significant differences were frequently observed in features of HRV measurement in the frequency domain (n_sign = 6/n_used = 10, 60%).In contrast, the features of HRV measurement in the time domain were analysed less (n_sign = 2/n_used = 3, 66.7%).Heart rate measurement and analysis were performed more frequently, but in relative terms, significant differences were observed less frequently (n_sign = 4/ n_used = 11%, 36.7%).In the EEG measurements, the power of the alpha band was most frequently examined (n_sign = 4/ n_used = 8, 50%).
The results mentioned above are in line with the findings of Giannakakis et al (2019), who summarized studies dealing with general detection of stress by physiological measurements in their review.They emphasize that the measurement of EDA is one of the most consistent indicators of stress.Likewise, the analysis of alpha band power is shown to be the best indicator in the field of EEG analysis.Only heart rate analysis is also listed here as a consistent indicator of stress, which is contrary to the results of this review.One reason for this may be the fact that heart rate, while a good measure of whether stress is present or not, is less likely to show differences between different gradations of stress (Hossain et al 2019).Precisely such gradations were frequently investigated in the studies of this review.The measurement of HRV parameters of, for example, stress responses has been shown to be a valid method in other reviews.Kim et al (2018) conclude that HRV is influenced by stress and that it is suitable for the objective assessment of mental health and stress.
However, one aspect that must be mentioned as a limitation of many studies in this field is the lack of consideration of inter-individual differences in physiological responses.Often, high variances exist in the physiological parameters between the individual participants, making it mandatory, for example, to normalize the parameters prior to analysis.In addition, there is a lack of approaches to explain the often large variance with the help of further recorded individual data.A few studies already try to counteract this, for example by using a questionnaire such as the WNSS.In this way, further information about the psychological condition of the participants can help to explain individual differences in the analysis.
When listing the measurement systems used in these studies, it is noticeable that, with regard to acoustic stimuli, only two studies used wearable sensors, for example in the form of a wristband.Although challenges in signal quality of sensors on the wrist is often listed as a reason, algorithms for reconstructing the signal can be used to overcome this (Iadarola et al 2021b).For research with acoustic stimuli, it would be desirable to have more studies that use such less interfering sensors, as this difference might be relevant for studies which aim for a high degree of context and immersion.An even further step would be the use of measurements that have no contact with the participant's skin.Reviews show that the measurement of some of the parameters listed in this review can also be performed without contact, i.e. unobtrusively, using cameras (Bruser et al 2015, Antink et al 2019).

Psychological loads
The psychological load most frequently investigated in studies of this review is the stress response.It is also a very common response investigated with physiological measures in other literature reviews (Karthikeyan et al 2011, Alberdi et al 2016, Kim et al 2018), since it is strongly associated with activation of the SNS and therefore often produces well-detectable changes in physiological parameters (Henry 1993).Furthermore, listening effort was examined as a cognitive load in eight studies.Of these eight studies, five used pupillometry to detect listening effort.Again, this is consistent with reviews showing that pupillometry provides meaningful parameters for detecting cognitive workload in past studies (Charles andNixon 2019, Winn et al 2018).
However, when considering the various psychological loads, it is important to note that changes in physiological parameters are not always specific to a particular type of psychological load.An example of this is provided by the distinction between annoyance and stress or between stress and arousal.Especially for the last comparison, a definition of the terms indicates how overlapping both terms can be used yet should be distinguished.The bi-modal model of emotions according to Russell et al (Russell 1980) shows that arousal can be accompanied by stress, but depending on its 'valence', which describes the intrinsic attractiveness, it can also result in positive excitement for a higher valence.
Since physiological measurements often do not indicate a single psychological load, further indications about the stimuli, for example, from databases such as the extended version of international affective digitized sounds (IADS-E) (Yang et al 2018), help to define the psychological load more precisely.
A perceptual comparison can also provide further information about the psychological load.For this purpose, VASs which directly ask about the psychological load were primarily used in a total of six studies.One reason for the frequent use of VAS might be their advantages, as they are intuitively understandable for the participants.In addition, the quick responses do not interfere with the flow of the experimental procedure as much as longer questionnaires do.These slightly longer questionnaires include, for example, the six-question NASA TLX, which has also been frequently used in studies of this review.In contrast to the VAS, the NASA TLX provides more comprehensive information about the psychological load experienced by the participant (Hart 2006).
In summary, bringing together prior information about the auditory stimuli, perceptual results, and analysis of the physiological data can create a more comprehensive picture of the psychological load.

Experimental design
On one hand, it can be seen from the studies that the physiological parameters allow a comparison between the individual studies, since the choice of parameters for the respective signals as well as the methodology of their calculation are very consistent among the studies as described before.On the positive side, it should be noted that the studies primarily make use of the evaluation procedures commonly used in the literature.However, this is contrasted by the numerous different methods used in the experimental design of the individual studies.These prevent the results of the individual studies from being comparable and from being analysed in a more structured manner.The different scenarios, playback methods and durations of the individual test phases lead to the fact that the results are often only valid for this particular experimental case.As a consequence, it is difficult to obtain more valid statements on specific acoustic stimuli, for example in the form of meta-analyses.
Among the acoustic stimuli used as independent variables in the studies of this review, the presentation of traffic noise was prominent and used in eight studies.This might be motivated by the fact that traffic noise is also considered to be the main responsible factor for modern noise-related health problems (W.H. Organization 2018).In addition, background speech was used in six studies to investigate the physiological response of participants.This is also the most commonly used stimulus in the review by Schlittmeier (2021).to investigate the effects of noise on cognitive performance.
In conclusion, these two types of stimuli have already been used to investigate general noises in everyday life and at workplaces.This also includes the use of office noise as a stimulus.However, since noise at the workplace is responsible for many health problems due to often long exposure times (W.H. Organization 2018), there is a lack of studies in the field presented here on numerous specific workplace noises.First studies such as the investigation of noises inside an operating room by Salzman et al (Wang et al 2022) or the investigation of responses to industrial noises by Lu et al (Hossain et al 2019) are taking a first step in this direction.
In addition to the approach of identifying potentially stress-inducing sounds, some studies start at the opposite by investigating which sounds can bring the body back to a resting state after stress responses.Again, the results of this review indicate that only a few studies investigate this aspect by measuring the physiological response to nature sounds (Trimmel et al 2012, Abbasi et al 2020).Here, in conjunction with, for example, the soundscape approach (DIN ISO/TS 12913-3:2021-06 2019), further insights into the restorative effects of sounds could be obtained in the future.
The commonly used cognitive tests extracted in this review, the visual recall task and the n-back test, are also consistent with those in the review by Schlittmeier (2021).They concluded that the visual recall task is most frequently used for testing cognitive performance under noise.As their review shows, the n-back test is also used in basic research reports on this topic.Since these cognitive tests have already been used successfully in combination with acoustic stimuli in the field of psychology, this explains their use in combination with physiological measurements.

Limitations
As mentioned before, the comparisons and analyses in this review should be taken with a certain degree of caution, as a wide variety of experimental designs and methods were used in the studies.Thus, this review cannot claim to highlight the one perfect experimental design, but rather to introduce different methods and show which combinations of methods worked well in each specific case.
It should also be noted that the search was limited to two databases and that only laboratory studies were included in the final selection.A review on the use of physiological measurements with acoustic stimuli in field studies, together with this review, might create an even more comprehensive picture of the investigation of physiological responses to sounds.

Conclusion
The use of physiological measurements to capture physiological responses to acoustic stimuli is far from trivial.To provide guidance in this area, this review extracts the various factors that influence whether the conclusions drawn from a study are valid.New studies in this field can consult this review for help in planning the study, since this review lists important factors and suggests possible methods that have been used successfully in past studies.
Here, the crucial factor is the psychological load that is to be investigated.In the form of table 3, suitable measures and resulting features can then be selected.Together with the stimuli to be investigated, these features in turn provide information on how the experimental design should be arranged.A suitable scenario can be selected with the help of table 4 and the individual experimental phases can be determined with the suggestions from table 6.For a comprehensive evaluation, again table 6 can help to find a suitable signal processing and analysis of the features afterwards.Perceptual data can be used to complement this as discussed in the end of section 3.In the future, this guidance could enable many more studies to be conducted on the effects of sounds on humans.Above all, the knowledge gained in this way can help to better assess the health consequences of noise.Further investigations of the types of stimuli already studied will contribute to this assessment as much as investigations of more specific, new stimuli.With the help of these further investigations, the conclusions regarding the response to noise can be strengthened and become more valid.

Figure 1 .
Figure 1.Search query, each of which connects an acoustic term to one of the physiological signals with a logical AND.The commas represent a logical OR on both sides.The written search query thus contains: ('noise' OR 'listen * ' OR ...) AND ('physiological response' OR 'emotional response' OR...).

Figure 2 .
Figure2.Report of the screening process with the individual steps according to the PRISMA standard.On the right records and reports are listed that were excluded from this review paper.

Figure 3 .
Figure 3. Overview of the individual physiological measures and how frequently they were used in the 37 studies.

Figure 4 .
Figure 4. Overview of the individual psychological loads and how frequently they were investigated in the 37 studies.

Figure 5 .
Figure 5. Overview of the years in which the 37 studies of the final selection where published.

Table 1 .
(Otto et al 2001)riteria which find their application in the methodology of a listening test(Otto et al 2001).Only a few examples are given for the attributes since every possible property of a sound can be used here.

Table 2 .
Overview of used physiological measures with their derived features and descriptions of the sensor systems.

Table 3 .
Overview of the 37 studies which resulted from the screening process in chronicle order.For each study the used physiological measures and the investigated psychological load are listed.Nine studies did not use real sounds, but synthetic white noise or other broadband noise.
EEG: freq.bands power, higuchi fractal dimension (HFD) and microstates Attention Sign.Increase in HFD with binaural beats, sign.increase in duration and coverage of microstate D/ decrease in microstate A (microstates: transient patterns in EEG signal) Fan et al (2022) EEG: freq.bands power Stress Features at rest unaffected by noise, with cognitive task: sign.higher SDNN and pNN50 under higher noise intensity, sign.decrease in pupil diameter from 85 to 80 dB, sign.decrease of saccade velocity with increased workload ECG: HRV (SDNN, RMSSD, pNN50) PM: fixation duration, saccade amplitude/velocity, pupil diameter Sadeghian et al (2022) EEG: freq.bands power Annoyance Sign.increased activity of the θ and β bands and decreased activity of the α band with higher tone ECG: HRV (LF, HF, N-N, RMSSD, SDNN) BP Stress Sign.decrease in HRV (LF and RMSSD) during low-and high-frequency noise exposure Mackersie and Calderon-Moultrie (2016) EDA: SCL Listening effort Sign.increase in SCL and a decrease in HF with increased speaking rate ECG: HRV (HF) Sim et al (2015) ECG: HR, HRV (LF, HF, LF/HF, SDNN, total power (TP) physical stress index (PSI)) Stress Sign.difference after background noise in SDNN, PSI, TP, HF, and the LF/HF, sign.decrease in LF at background speech compared to noise Paunovic et al (2014) Thoracic electrical bioimpedance: cardiac parameter, HR, HRV (HF, LF, LF/HF) al 2007, Trimmel et al 2012, Alhanbali et al 2021, Alyan et al 2021, Radun et al 2021, Salzman et al 2021).Five other studies used the NASA Task Load Index (TLX) instead (Zhou et al 2014, Waterland et al 2016, Salzman et al 2021, Li et al 2022, Rı ́os-López et al 2022).It was designed to assess participants' cognitive load