Multiple sound source localization using gammatone auditory filtering and direct sound componence detection

In order to research multiple sound source localization with room reverberation and background noise, we analyze the shortcomings of traditional broadband MUSIC and ordinary auditory filtering based broadband MUSIC method, then a new broadband MUSIC algorithm with gammatone auditory filtering of frequency component selection control and detection of ascending segment of direct sound componence is proposed. The proposed algorithm controls frequency component within the interested frequency band in multichannel bandpass filter stage. Detecting the direct sound componence of the sound source for suppressing room reverberation interference is also proposed, whose merits are fast calculation and avoiding using more complex de-reverberation processing algorithm. Besides, the pseudo-spectrum of different frequency channels is weighted by their maximum amplitude for every speech frame. Through the simulation and real room reverberation environment experiments, the proposed method has good performance. Dynamic multiple sound source localization experimental results indicate that the average absolute error of azimuth estimated by the proposed algorithm is less and the histogram result has higher angle resolution.


Introduction
Sound source localization (SSL) is a key research point in the field of speech signal processing. It usually uses an array of microphones to receive the acoustic signal and applies a series of signal processing techniques to estimate the direction of arrival (DOA) of active sound sources. It plays a significant role in many other application scenarios such as sound signal separation, speech enhancement and recognition, speech de-noising and echo cancellation, robot and human-computer interaction, remote video conferencing system, smart home monitoring system and so on.
According to the relevant research results, most state-of-the-art sound source localization methods can be divided into three categories. The first category is the localization method based on the time difference of arrival (TDOA) between different microphone pairs. It is easy to implement and has low complexity. TDOA-based methods can meet the real-time requirements of a single sound source azimuth estimation, but fail when multiple active sound sources exist [1] . The second category is beamforming method, which is one of frequency domain computing technique. While this kind of localization method has bigger angle ambiguity and high complexity in terms of the existence of multiple active sound sources. The last category is subspace decomposition based technique, which is capable of processing the situation of multiple sources. The typical approach is the well-known multiple signal classification (MUSIC) algorithm. Broadband MUSIC, deformed MUSIC algorithm, can give the estimated value of multiple sound source azimuths simultaneously, which depends on the eigenvalue decomposition of spectrum covariance matrix. This method has high angle resolution.
Traditional MUSIC approach is usually applied to the radar antenna array signal processing, which requires stationary far-field narrowband signal. The majority of broadband MUSIC deformation algorithms are gradually put forward for different wideband application situations [2,3] . Although broadband MUSIC was improved to obtain estimation value of DOA, the position accuracy is not high. Because broadband MUSIC method divide the whole frequency band into several equal length sub bands or frequency bins, but some frequency bins not always satisfy the narrowband characteristic. To be specific, parts of frequency bins within low frequency components cannot be treated as narrow band signal componence. Because the corresponding center frequency (denoted with 0 f ) is comparatively small, while bandwidth (denoted with B ) is fixed, which can't meet the narrowband approximating condition: .This broadband MUSIC approach exists the estimated error essentially. Auditory filtering based broadband MUSIC (Gammatone-BroadbandMUSIC) was presented to localize two sound source in simulation environment without reverberation [4] . Gammatone auditory filtering is a multi-channel bandpass filtering technique with different variable bandwidth, which has smaller bandwidth when center frequency is low. Moreover, pseudo-spectrums of all the frequency channels need to be calculated, causing huge calculation quantity. What's more, this algorithm is only suitable for no reverberation environment.
Considering the above existed shortcomings, this paper comes up with a new broadband MUSIC algorithm with gammatone auditory filtering of frequency component selection control and detection of ascending segment of direct sound componence, aiming at estimating DOA of multiple active sound sources in the room reverberation environment. This paper assumes the far field model. The proposed algorithm controls frequency component within the interested frequency band in multichannel bandpass filter stage. Because the frequencies of speech voices generally change from 200Hz to 3000Hz. On the other hand, detecting direct componence of sound source supresses the room reverberation, which will enhance the accuracy and reduce the computing time owing to less sound frame to be processed.

Gammatone filtering based MUSIC algorithm
Gammatone filter is a bank of bandpass filters, having different central frequencies and bandwidths. As to gammatone filter, lower the center frequency is, narrower the corresponding bandwidth is. The impulse response of the k th filter is as follows [5,6,10] : kC  , C is the total channels, k c is the gain coefficient, n is the the filter order, k b is the decay coefficient, k f is center frequency of the filter and k  is the phase. The frequency response of four-order and 64 channels gammatone filter is shown in Figure 2, which is used during all simulations and experiments in this paper. The decay coefficient k b can be calculated by the following equation [6] : (2) which ERB is short for the equivalent rectangular bandwidth and is a psychoacoustic measure parameter of auditory filter. This paper assumes the far field model. Suppose that the number of sound sources is M and uniform linear microphone array has N elements. The schematic diagram is in Figure 1. After the q th speech frame received by the n th microphone () x n q is filtered by gammatone filter with C channels, the filtered signal matrix can be denoted as: where q denotes number of the speech frame, L is the length of each frame.
nN  . Then the signal matrix expression of k th frequency channel of the q th frame signal received by microphone array is as follows: k G q f to convert the real signal into a complex signal. The expression of transformed signal is: where ( , ) Hilbert( ( , )) kk nn z q l g q l  , l is the intra-frame offset, . The subscript GH denotes the result after Gammatone filtering and Hilbert transform processing.
Then the expression of k th frequency channel of the q th frame signal received by microphone array is changed into: Af  is the direction controlling matrix of whole sound source system in k f frequency component [3,4] . By computing the covariance matrix of ( , ) GH k X q f and carrying out eigenvalue decomposition, we can obtain the orthogonal signal subspace S U and noise subspace N U .Then the pseudo-spectrum of k th frequency channel of the q th frame signal can be given by: where the superscript H denotes conjugate transpose.
2.2 Frequency selection control and detection of ascending segment of direct sound componence As mentioned above, pseudo-spectrums of all the frequency channels need to be calculated when using Gammatone-BroadbandMUSIC algorithm. Besides its high complexity, parts of frequency channels are noise interference probably. We advise that filtering and calculation are only in the interested frequency band. In this paper, the band is limited to 200 ~ 3000Hz. This will reduce the error interference in essence, and also reduce the processing time. The article [4] mentioned that Gammatone-BroadbandMUSIC algorithm was only suitable for no reverberation environment. However, the reverberation in the room is unavoidable, usually very strong, which will cause poor results. In fact, the voice signal transmission will encounter tables and chairs, wall absorption and multiple reflections, so the signal received by microphone is the mixed signal along multiple propagation paths. It mainly consists of three parts: direct sound componence, early reflection and reverberation componence.
This paper proposes a bran-new method based on local maximal judgment to detect the ascending segment of direct sound componence. The termination condition of the searching is that if there is a continuous number of K maximal points (K was set to 10 in this paper.), whose amplitudes are less than the certain scanned maximal point. This point will be regarded as the end point of speech ascending segment. Although the maximal point detected by this method is not the actual position, the effect of the late reflection and reverberation is greatly reduced. Figure 4 shows the detection result of ascending segment of direct sound. The simulation results in the next section also show that this detection method is valid and improves the SSL accuracy. Especially to deserve to be mentioned, we found that the estimation results of different frequency channels were discrepant. The pseudo-spectrum of the lower frequency region had a larger peak and the number of peaks was more reasonable. However, many pseudo peaks appeared in the higher frequency region, and the amplitude is relatively small. So the weighted technique is proposed according to the maximum value of each pseudo-spectrum.
Then the final pseudo-spectrum of each speech frame can be obtained by:

Simulation and experiment
In this section, the performance of proposed method from the perspective of both computer simulations and real position experiments is shown. We compared three performance indicators of the proposed method with broadband MUSIC and Gammatone-BroadbandMUSIC method.

Performance indicators
The first performance indicator is the Frame Accuracy Rate (FAR), used to evaluate the frame estimation accuracy rate after frame framing of speech from the given position. The specific expression is given by: The second indicator is the Mean Absolute Estimated Error (MAEE), which is used to evaluate the absolute deviation of the estimated results from the reference angle for each speech frame [7] . The specific expression is given by: The third indicator is the Histogram Absolute Estimated Error (HAEE), which is the histogram of total estimated results of all the frames. is the total number of frames for the calculation process. HAEE is a statistical average of all results, smaller relative to MAEE value.

Simulation results
The simulation is based on Lehmann's improved Image Source Method [8,9] . The simulated microphone array is a uniform linear array with 8 elements (N = 8), the array interval d = 0.06m. The room size is 6.50 m*7.80 m*2.90 m. The array is located at (3.0 m, 1.5 m, 1.5 m). The height of microphone array and the sound source is 1.5m in the same horizontal plane, which will help reduce estimation error. The sampling frequency is 200 kHz. The reflection coefficient of four walls is 0.7, while the floor and the ceiling possess 0.8 and 0.6. The acoustic velocity.is 343 m/s. The three sound sources used in the simulation are the Chinese pronunciation recorded signal of "Tsinghua University" (referred to as sample 1), TV snow noise (referred to as sample 2) and English pronunciation signal of digital 1 to 10 (referred to as sample 3). Three typical azimuths are -25 degree, +5 degree and +40 degree. The first simulation was carried out to illustrate the effectiveness of the method of direct sound detection. Sound source sample 1 was placed +40 degree with 2m distance. The estimated results before and after the direct sound detection were compared when changing the reverberation time parameter 60 T from 200 to 600 ms. The specific result is shown in Figure 5 and Table 1. Both Figure 5 and Table 1 show that the improved Gammatone-BroadbandMUSIC algorithm with detecting ascending segment has good performance, because three performance indicators all are better even in the strong reverberation condition. This indicate that the bran-new detecting method of direct sound is effective.
In order to compare the proposed method with broadband MUSIC (also called as algorithm 1) and ordinary Gammatone-BroadbandMUSIC (also called as algorithm 2), we performed the second simulation: the position of microphone array was fixed, reverberation time 60 T =200ms, the distance between the array and sound source changed from 1m to 5m and distance interval was 1m, each location were simulated for three different directions with using three different sound source signal. So there are 45 groups of simulation data totally. Three algorithm were performed for every group simulation data. Their average performance indicators are presented in Table 2. It's clear that the proposed algorithm has higher frame accuracy rate and less estimated absolute error.

Experimental results
This section presents two parts of experiment, including single sound source localization and dynamic multiple sound source localization. All experiments were implemented in the real meeting room with the about size of 6.50 m*7.80 m*2.90m. The room environment was complicated because there were many chairs, several sofas and a big meeting desk, which would arise strong reflection and reverberation and worsen the experimental localization results.  Figure 6: Different positions for sound source acquisition Figure 6 shows the 36 different positions on the same side of the microphone. Their azimuth angle is in the range of -70 degree to +70 degree in same horizontal plane. We collected three sound source samples at each position, a total of 108 groups of sound data. For each group of data, three different sound source localization algorithms were used for azimuth estimation, the results were shown in Table 3 below. As can be seen from Table 3, the proposed algorithm has better performance than the other two algorithms, and the frame estimation accuracy of the three speech samples is close to 70%. 6904 On the other hand, in order to illustrate the effectiveness and expansibility of this algorithm, multiple sound source localization experiments were continued in the same indoor environment. Here the specific description of the dynamic sound source experiment is as follows: three specific sound source positions corresponding to the red rectangle(numbered as 1,2,3) in Figure 9 were selected, whose azimuths are approximately -21 degrees, + 58 degrees and 0 degrees, respectively. Three sound source phonated in turn, then followed by two mixed voices and three sound sources at the same time. So the number of active sound source is variable.  The experimental results of the dynamic estimation and histogram of three different algorithms are shown in Figure 7. Table 4 shows three average HAEE indicators contrast with different algorithms, the proposed algorithm has higher estimation accuracy. The dynamic estimation results of different algorithms can be seen in Figure 7(a1)-(c1). The proposed algorithm not only has the advantage of the two other methods in the dynamic estimation of each frame, but also has the better angle resolution of the statistical histogram.

Conclusion
In this paper, a new broadband MUSIC algorithm with gammatone auditory filtering of frequency component selection control and detection of ascending segment of direct sound componence is proposed. Based on ordinary auditory Gammatone filtering broadband MUSIC algorithm, three improvements are made. First of all, controlling frequency componence selection when gammatone filtering, which is limited into interested band. Secondly, detecting the ascending segment of direct sound componence is extremely necessary. It can suppress the reflection and reverberation validly. Thirdly, the pseudo-spectrums of different frequency channels of each frame are weighted according to the maximum value of each pseudo-spectrum. Under three performance indicators given in this paper, the simulation and experimental results show that the proposed algorithm has better performance. This paper only gives the azimuth estimation of multiple sound sources. In the follow-up study, we will focus on the estimation of the corresponding distance.