Sound classification with time-frequency features in forest environment

The study of forest sound classification has drawn more attention recently due to its potential for illegal activities and natural disaster monitoring. Based on the forest sound classification dataset (FSC22), a dataset specific to possible sound existing in the forest, five classification methods are utilized to investigate the relationship between recognition accuracy and the number of sound acoustic features, as well as the number of target classes. The results confirmed that extreme random forest is the best method for forest sound classification, with an accuracy of around 70% when the target class number is above 20. Further, Mel-frequency cepstral coefficients are the critical feature for sound classification, while fuzzy labels in the dataset may reduce the success rate of recognition.


Introduction
Environmental sound classification (ESC) plays a significant part in automatic environmental monitoring [1] and disaster prevention [2].Specific to natural environmental sound monitoring, a common application scenario is the recognition of possible man-made and natural forest threats.After capturing sounds by detectors distributed in the forest, those acoustic data should need to be processed and classified into different sound classes.When the sound is possibly related to a threat, the related department could respond rapidly, reducing potential economic losses and labor losses.
One key point of sound recognition is classification approaches.Many efforts have been made in this field.Machine learning (ML) models, including K-nearest neighbors (KNN), Gaussian mixture modeling (GMM), support vector machine (SVM), random forest (RF), XGBoost, and deep learning (DL), as convolutional neural networks (CNN), have been widely applied on sound recognition.Compared to DL, ML usually requires less execution time and lower computational power but has a higher dependence on features accurately identified by humans [3].To improve classification performance, each of these algorithms needs a significant amount of labeled data.
Another factor influencing recognition performance is the quality of the dataset.A large dataset covering all kinds of sound in a specific environment could facilitate classification learning.Datasets utilized in prior studies focusing on forest sound classification can be mainly divided into two classes.One class is animal sound collections for forest species recognition, for example, the xeno-canto Archive [4].The other class is general environment sound datasets such as ESC-50 [5] and FSD50K [6].These datasets contain many different types of environmental sound including those rare in forest environments.Those sounds with low correlation to the forest environment may reduce the classification accuracy.However, these two classes of datasets cannot reflect the real forest acoustic environment with  [7] containing five classes of sounds that possibly exist in a forest environment.
Typically, sound signal classification has two steps: sound feature extraction and classification.The computational complexity of the classification increases as the number of features increases.However, the combination of many individual features possibly leads to a lower classification rate, though they may describe sound more thoroughly.Therefore, it is helpful to exclude non-critical features to balance accuracy and computation complexity.This paper estimated the effects of different feature combinations and five classification methods on classification accuracy.
The rest of the paper is organized as follows: Section 2 briefly introduces the chosen classification method; In Section 3, a brief introduction of the FSC22 dataset and acoustic features planned to extract are mentioned; The impact of these features and classification methods on the result is discussed in Section 4; Finally, we draw a brief conclusion in Section 5.

Classification methods
Classification and clustering are mainly used for data processing under supervision, which means that these models are trained by a set of training data, and then they are evaluated by test data before being used to perform predictions on new unseen data.Many methods have been developed for classification problems.Four methods, K-nearest neighbors, decision tree, extremely randomized trees, and support vector machine, are introduced.

K-nearest neighbor classifier
The K-nearest neighbor (KNN) algorithm is straightforward, i.e., for a certain training dataset and an unknown input instance, the input instance is classified by finding the K closest subjects to the instance in the training dataset (i.e., the K neighbors as described above), which belong to a certain class for the majority of these K instances.It estimates the conditional probability by the following formula: where K is an integer, x0 is a test observation, and N0 stands for the K-nearest neighbors.

Decision tree classifier
Decision trees, as a typical predictive model, reflect a mapping relationship between object attributes and object values.Each test on an attribute abstracts into an internal node, each test output abstracts into a branch, and each leaf node represents a category.All data will eventually fall into leaf nodes, which means that it can be used for both classification and regression, as shown in Figure 1.

Random forest classifier
Using the bootstrap method, a Random Forest (RF) builds several decision trees, also known as a forest, over the extracted training set samples.RF will select the best split at each node when building the decision tree.The appropriate number of trees, referred to as sub-models, have been built.This procedure is repeated on a different subset of the data with various attributes.A voting approach is utilized to solve the classification problem, and the classification category of the sub-model that receives the most votes is the chosen category.Figure 1 shows the decision trees produced by the random forest algorithm in comparison to simple decision trees, which use a branching strategy to indicate all potential outcomes of a decision.

Extremely randomized tree classifier
An Extremely Randomized Tree Classifier is a method of integrated learning that generates classification results by combining the findings of several de-correlated decision trees that form a forest.The original training samples are used to generate each decision tree in an extremely random tree.Each tree has a random sample of k features at each test node from which it must include the top feature before dividing the data according to some mathematical criteria.A number of unconnected decision trees are created as a result of this random sampling of characteristics.

Support vector machine
Support Vector Machine (SVM) uses supervised learning to divide boundaries among the classification of data, with the maximum-margin hyperplane calculated for the training samples serving as the decision boundary.Especially in linear models, it has effective results.The fundamental concept is to convert the issue into a convex quadratic programming issue.The SVM method looks for a classifier that maximizes the classification boundary between the closest data point and the hyperplane (the classification edge is the distance, which can be defined by Euclidean geometry or cosine, between the nearest data points and the hyperplane).It is commonly accepted that the decision plane for the SVM algorithm will be better if the classification edge is larger, and it is frequently said that the decision plane with "largest spacing" is the best option.The ideal outcome that SVM seeks is typically characterized as the decision surface with "maximum spacing".As Figure 2 displayed, the optimal solution corresponds to the sample points that are dotted on both sides and can be connected into straight lines, which are defined as "support vectors".

Database and feature extraction
As explained, the project aims to identify and categorize environmental sounds in a specific natural environment including human activities.To do that, a well-organized database containing quality audio data is necessary.Rather than popular generic environment sound datasets, we used a recently published Forest Sound Classification Dataset (FSC22).

Forest sound classification dataset
FSC22 was created as a benchmark forest sound dataset.The dataset comprises the six most prevalent types of stationary and non-stationary noise sources: machine, animal, environment, vehicle, forest threat, and human voice.Each type was further separated into some subclass aimed to prevent the ambiguous use of labels.Therefore, 27 scenario-specific low-level classes were created from the 2,025 audio samples with a normalized sample rate of 44, 100 Hz.

Acoustic features
Generally, multiple different acoustic features are combined for sound classification since a single acoustic feature makes it difficult to fully describe sound.To estimate sound data from multiple perspectives, seven acoustic features, spectral centroid, bandwidth, contrast, flatness, roll-off, zerocrossing rate (ZCR), and root-mean-square (RMS) energy, are utilized to describe these audios.Besides these common features, Mel-frequency cepstral coefficients (MFCCs), which symbolize the short-term power spectrum on a nonlinear Mel scale of frequency of a sound with a linear cosine transform, are another popular feature in automatic speech and sound recognition.These features could make a general description of the sound.

Experiment and result
To evaluate the effects of different acoustic feature groups on the classification rates, features extracted from the FSC22 dataset are divided into three groups: spectral contrast of seven sub-bands of the spectrum, 39 MFCCs, and the other common acoustic features introduced in the previous sections.These groups are referred to as SC, MFCC, and CAF, respectively.After extracting features, five classification approaches are applied to the dataset to compare the impacts of methods on classification accuracy.In addition, since there are 27 sub-classes of sound in the FCS22, the relation between the number of target sub-classes and accuracy is also in investigation.Firstly, we chose the sub-classes and corresponding audio data with different target class numbers ranging from 2 to 27 at the interval of 5 before recognition.For each number, sub-classes are randomly chosen, and the classification is operated five times with different sub-class combinations.Five approaches are adopted to clarify their impact on accuracy.The final results are the average value accuracy of classification five times, and the results are shown in Figure 3.All the methods succeeded in classifying when only two sub-classes of data were considered.However, the classification rate of SVC reduces steeply when the number of sub-classes increases to 3, and the accuracy of it is always below 20% when the class number is above 5.This result reflects the limitation that SVC is focused on the binary classification problem.In contrast, RF and ERF make the best recognition for all the subclass combinations.ERF leads to an accuracy of 66% when all classes of sound are considered and brings an accuracy above 95% when the target sound class is less than seven types.Further, their accuracies are always above 0.6 regardless of the target class number.In addition, DT and KNN provide worse results for multiple classification tasks compared to RF and ERF, which are much better than SVC.For different feature combinations, 27 sub-classes are all considered during the classification.Figure 4 expresses the influence of feature combinations on the final accuracy.CAF and SC, referred to as the seven common acoustic features, both lead to accuracies below 0.4, which means they do not provide effective guidance to the classification.On the contrary, the result based on the MFCCs is similar to the result based on all the features, with the maximum variance below 10%.It indicates that MFCCs are the main feature of the sound.Considering all the features and all sub-classes, the classification accuracy of ERF can be expressed as a confusion matrix, which is shown in Figure 5.The classification accuracies are above 70% for nine types of sounds, and those of 7 types of sounds are around 30%.Despite the limitation of ERF approaches, the similarity of some audios and related acoustic features are also responsible for the low accuracy.For example, only 26% of generator sounds are correctly recognized, 21% of them are clarified to vehicle engine sound, and 16% of them are regarded as speaking.From one perspective, there is an overlap of the source of sound between the generator and the vehicle engine.The structures of their MFCCs are similar, which can be reflected by their Mel-spectrograms in Figure 6.

Conclusion
Environmental sound classification in forest ecosystems focuses on identifying natural and artificial phenomena, which can satisfy the demand for illegal activity and disaster monitoring.In this paper, a classification of the FSC22 audio dataset based on five machine-learning approaches is undertaken.The investigation showed that ERF provided the best recognition despite the growing number of target sound classes, with a maximum classification accuracy of 66% for the total dataset.Further, among all the extracted features, MFCCs are the main feature contributing to high recognition accuracy.However, for some sounds with similar Mel-spectrum, all the models cannot distinguish them effectively.In future work, we plan to apply hierarchical sound signal classification on ESC as it may make full use of the advantages of high accuracy of ERF with low target class numbers.

Figure 3 .
Figure 3.The dependence of classification accuracy on the target class number.

Figure 4 .
Figure 4.The classification accuracies with different feature groups.

Figure 5 .
Figure 5. Confusion matrix of extreme random forest.