Research on Sound Source Localization of Multiple Fixed Targets Based on Machine Learning and Distributed Arrays

This paper studies a sound source localization method of multiple fixed targets based on machine learning and distributed arrays. In an outdoor open field, a three-line array was applied to collect array data and calculate latency characteristics. Then multiple classification models were established and trained. Finally, the locations of the sound source points were predicted by those models, in which the support vector machine (SVM), the nearest node (KNN), and the naive Bayesian model achieved 100% localization accuracy. Compared to the conventional method, this method has three significant advantages: First, it does not rely on the microphone channel order and does not need to be calibrated in advance, which simplifies the localization process; Second, it can fulfill high accuracy requirements, especially suitable for the scene of multiple fixed targets; Third, it has the advantage of incremental learning, as the times of localization rises, the training set is continuously enriched and the localization results become more precise.


Introduction
For sound source localization based on microphone arrays, the beamforming algorithm based on the time difference of arrival (TDOA) is widely used.However, it is severely dependent on the geometric relationship of the microphone arrays, and the measurement error of the arrays' position has a great influence on the localization result.As a result of the improved localization accuracy requirements, the number of microphones is increasing, and so is the number of sound acquisition channels.To prevent channel disorder from affecting localization, it's necessary to coordinate the channels and microphones one by one, which increases the difficulty of detection in large experiments.The development and application of sound localization technology is impeded by the requirement for correct order and precise calibration of array shape.
Machine learning algorithms utilize classification identification to locate sound sources, which can overcome some shortcomings existing in traditional methods.It not only displays enhanced robustness but can also be effective when the microphone is unable to receive direct sound.More and more researchers have been using algorithms that incorporate SVM and neural networks for sound source localization in recent years. 7In 2006, Jwu-Sheng Hu et al. proposed a Gaussian hybrid model (GMM: Gaussian Mixture Model) based on the GMM [2] .In 2009, Huawei Chen et al. achieved indoor sound source localization using the least squares SVM method, which did not require advanced calibration of array shape [3] .In 2010, Byoung-gi Lee et al. extended the mutual angle correlation function according to the cross-correlation function and used the k-means++ algorithm [4] .In 2016, Christoph Beck et al. used the heuristic pulse neural network to build a sound source localization system composed of 8 electromechanical microphones, which had a very high localization accuracy [5] .In the same year, Daniele Salvati et al. proposed to use the RBF kernel support vector machine to construct the weighted minimum variance-free response beamformer, which could effectively deal with the single sound source localization in the near field [6] .
This paper proposes a sound source localization method based on machine learning algorithms and distributed arrays.This method does not depend on the array shape and microphone channel order.It's not only fast and accurate, but also has the advantage of incremental learning, which means that the results are more accurate with increasing localization times.

Design of a distributed acoustic localization system for multiple fixed targets
The distributed [7] acoustic localization system comprises two parts: an acoustic measurement site and a data fusion center.An acoustic measurement site is composed of a microphone array and a host machine.The data fusion center refers to the laptop computer or upper computer equipped with signal processing software suitable for the device.
In practice, multiple acoustic measurement sites are deployed.Different sites can lay out arrays of different shapes, and accomplish joint localization to make comprehensive use of various array shapes, which can be a combination of one line array and a cross array, the combination of three linear arrays or of four cross arrays, etc.

A theoretical description of the machine-learning methods
The sound signal samples that correspond to the sound source in different locations have distinct characteristics.By extracting this feature and establishing a classification model, the location of the burst point can be predicted.This is the principle of localization by using machine learning methods.For high SNR acoustic targets such as explosion sound and gunshot, SVM, KNN, Naive Bayes [8] and other algorithms have achieved a good localization effect.For brevity, we only describe the method of the SVM algorithm.
The SVM algorithm is essentially designed to solve the problem of finding the optimal hyperplane [9] .
Given the training set T= ( 1 ), ( 2 , 2 ), ⋯, ( , )}， ∈ { − 1, + 1}, where is the input vector, is the category to which the input vector belongs.The discriminant function of the training set is expressed by the following linear equation: = + (1) Where is the normal vector which determines the direction of the hyperplane and is the displacement constant which determines the distance between the hyperplane and the origin point.Mark the hyperplane as H, if the plane can correctly classify the training set, then: It can also be described as: The problem of finding the optimal hyperplane can be described as: The sound source location problem to be solved in this paper is described as follows: There are microphones which can be arranged into any shape and sound source points of which the position is fixed.The goal is to accurately locate the sound source position when signals are transmitted.The whole process is described as follows: Firstly, collect data and obtain samples and labels.The samples refer to the sound signal, and the labels refer to the sound source location label.With microphones and targets sounding for times, × 1 signal samples with labels can be obtained and each label contains samples.The second step is feature extraction.TDOAs are chosen as the features to train classification models.Select one of the microphones as a reference, and calculate the TDOAs of all the microphones relative to the reference microphone, then delays are obtained (one of the delays is 0), that is a set of characteristic values of the label.
In the third step, those samples are divided into a training set and a test set.Generally, 60 to 70% of the sample data is selected as the training set and 30 to 40% of the sample data as the test set.The fourth step is to train the classification model.The training set data with labels and delays are imported into the MATLAB classification learner to create the SVM classification model and train the relevant parameters.
The fifth step is to test the classification model.The test set data is imported into the trained model to compare the prediction results with the test set data and calculate the accuracy.
In the sixth step, the trained and tested classification model is applied for localization.Select an arbitrary point from the N sound source points to transmit the sound signal, and calculate the TDOAs.Put them into the trained classification model to obtain the corresponding label, which then indicates the position of the target.

Field experiments and data processing
The experiment is conducted outdoors, with firecrackers serving as the sound source and a sampling frequency of 25.6kHz.Four target areas are delineated in the field, a single pop is recorded in each of the four areas, and the same operations are repeated to construct the dataset.
As shown in Fig. 1, three line arrays are deployed, each having 12 microphones with a spacing of 0.65 m.Four burst points are uniformly arranged in the center of the array.GPS mapping is performed for each array as well as burst points to facilitate later inspection.Fig. 2. and Fig. 3. represent the scene of the field experiment.
The machine-learning-based explosion sound localization process is divided into three steps: feature value extraction, classifier training and test set prediction and evaluation.

Feature extraction
The generalized cross-correlation (GCC) algorithm [10] is applied to calculate the delay of each channel which is the feature value.There are 36 channels in total, so 36 feature values can be obtained for each pop.The label of each explosion sound is the corresponding coordinate of the explosion point.There are 75 explosive sound samples obtained, and 30 of them are deleted for having obvious error.18 of the rest are complete and valid, resulting in the extraction of 36 feature values.These 18 sample signals are selected as the training set, with the remaining 27 sample signals used as the test set.In this experiment, there are three arrays, each of which has 12 microphones.For each array, the first microphone is selected as a reference, and the three arrays can obtain a total of 36 delays, three of which are 0. Fig. 4. represents the time delay feature of the four burst points.Table .1 shows the delays calculated by the receiving signal of one array.

Create and train the classifiers
In MATLAB classification learner, multiple models are created, and the training accuracy is tested by cross-validation.The confusion chart and ROC curve of several successful models are shown in Fig. 5.The rows of the confusion matrix correspond to the true class and the columns correspond to the predicted class.Diagonal and off-diagonal cells correspond to correctly and incorrectly classified observations, respectively.The text in the center of the cells indicates the number of samples in a tag.In confusion charts (a), (b) and (c), only diagonal cells are colored, which means all the targets are correctly classified.Tag 1,2,3,4 contains 3,5,6,4 samples respectively and 18 samples in total, which is consistent with the expectation.In ROC charts (a), (b) and (c), the red point locates the present classification, where the false positive rate is 0 and the true positive rate is 1, so the area under the curve (AUC) is 1.In conclusion, the accuracy of the three classification models (SVM, KNN& and Naive Bayes) in training reaches 100%.The order of channels The order of channels

Prediction and evaluation by using the test set
Apply the trained SVM model to the test set.Fig. 6. is the result of using the test set to predict the sample labels.The blue line represents the tag labels of the actual explosion sound, and the red line represents the labels of the predicted explosion sound.To facilitate observation, the predicted line is moved up, in fact, the two lines completely coincide.Explosion point 1,2,3,4 corresponds to 11,5,7,4 times of pop separately.It can be known that the accuracy of the model reaches 100%, which is consistent with the reality.

Advantages
Compared with the traditional sound source localization method, the proposed machine learning method has the following advantages: 5.1.1.Lower complexity of signal processing and improved stability.Conventional sound source location algorithms based on TDOA need to use GPS to measure the precise position of each element in advance, and then calculate the distance of each element relative to the reference element, and then calculate the time delay, according to which the delay compensation is conducted.After compensation, the delay of each channel should be 0. Then sum the data of each channel and conduct energy integration, and the peak position of the integration indicates the location of the sound source.Because the time delay compensation is required for each microphone, accurate microphone position and correct correspondence between the microphone and signal channel are demanded.
Compared with the traditional method, the machine-learning-based fixed targets localization method provided by this paper does not require the delay compensation, but takes the delay as the characteristic value.Therefore, the precise position of the microphone and strict channel-microphone correspondence are not required.
Traditional localization method is affected by temperature, wind speed, pressure, array position measurement accuracy, and microphone channel order.In contrast, the machine learning method provided in this paper can constantly adapt to the changing environment through model training and updates, thus eliminates the complex analysis of the influence on environmental parameters and demonstrates higher stability and adaptability.

To meet the high accuracy requirements.
The core of machine learning-based localization is classification, which utilizes historical sample data and prior information.When the sample data is used to train the machine learning model, and after inputting the new data, the localization results can be obtained.The traditional method is based on the joint localization of multiple arrays (the experiment in this paper uses three arrays), and the localization results are settled in real time, according to the experimental data.The two localization methods are essentially different, and the accuracy of machine-learning-based localization method is higher when targeting the sound source of fixed targets.

Incremental learning advantages.
The machine-learning-based method of explosive acoustic localization of multiple fixed targets provided in this paper makes full use of historical data.After determining the detection array for fixed targets, the localization result can be added to the sample library once the label is correct.With the enlargement of sample data, the localization accuracy will become higher and higher.In addition, there is no delay compensation, integral operation, or other complex steps, so the localization speed is faster.

Shortcomings
Firstly, the training of the model requires historical data and prior information, and the localization accuracy is impaired in the absence of sufficient data.In addition, only targets in the labeled area can be located.Lastly, only one label can be predicted at a time, thus it is not suitable for the scenario where two signals are completely aliased.However, such situation seldom happens in the localization of pulse sound signal, and multi-source localization in a short time is feasible as long as the TDOAs can be calculated.

Conclusion
In this paper, machine learning is successfully used to do sound source localization for the first time, which proves that the sound source localization method based on the SVM algorithm is well adapted to the scenario of fixed sound source localization.Compared with the traditional multi-array localization method, it has three main advantages summarized in the previous section.
However, the method is not applicable in scenarios with no prior information, poor historical data, and off-label targets.Introducing semi-supervised and unsupervised pattern recognition methods will help address this issue by delving deeper into the relationship between feature values and sound source locations, obtaining more universal rules, and achieving the localization of sound sources outside the label set.More source locations are also included in the future field experiment.

Figure 1 .
Figure 1.Schematic Diagram of the Experimental Scene

Figure 4 .
Figure 4. Time Delay Feature Diagram of the Four Burst Points

Figure 2 .Figure 3 . 2 Delay
Figure 2. Physical Diagram of the Linear Array 1 Figure 3. Physical Diagram of the Linear Array 2

Figure 5 .
Figure 5. Confusion Chart and ROC Curve of Several Successful Models

7 Figure 6 .
Figure 6.Comparison of the Prediction and Reality

Table 1 .
Time-delay calculation results