Cepstral coefficients effectiveness for gunshot classifying

This paper analyses the efficiency of various frequency cepstral coefficients (FCC) in a non-speech application, specifically in classifying acoustic impulse events-gunshots. There are various methods for such event identification available. The majority of these methods are based on time or frequency domain algorithms. However, both of these domains have their limitations and disadvantages. In this article, an FCC, combining the advantages of both frequency and time domains, is presented and analyzed. These originally speech features showed potential not only in speech-related applications but also in other acoustic applications. The comparison of the classification efficiency based on features obtained using four different FCC, namely mel-FCC (MFCC), inverse mel-frequency cepstral coefficients (IMFCC), linear-frequency cepstral coefficients (LFCC), and gammatone-frequency cepstral coefficients (GTCC) is presented. An optimal frame length for an FCC calculation is also explored. Various gunshots from short guns and rifle guns of different calibers and multiple acoustic impulse events, similar to the gunshots, to represent false alarms are used. More than 600 acoustic events records have been acquired and used for training and validation of two designed classifiers, support vector machine, and neural network. Accuracy, recall and Matthew’s correlation coefficient measure the classification success rate. The results reveal the superiority of GFCC to other analyzed methods.


Introduction
With the growing prevalence of shootings in public areas and the rise in gun ownership among citizens over all around the world, the demand for autonomous systems capable of detecting and identifying dangerous events like a gunshot is growing not only in public spaces, e.g.hospitals, campuses, hospitals, parks, or government buildings not mentioning traditional military applications.Existing surveillance Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
technologies, like cameras or drones, in combination with physical security measures, such as guards or patrol officers, offer limited protection and may prove inadequate in certain situations.In such scenarios, an automatic acoustic-based detection system holds an advantage over the described methods.The advantage of such an automatic acoustic system lies in the ability to detect, localize and effectively classify the type of acoustic event and, in a case of a gunshot, even estimate the caliber of a gun.Acoustic surveillance systems have the potential to identify dangerous events in various environments and, when strategically placed, to track the source of continuous threats.Such a system has been introduced and described by the authors in [1].The crucial parameters for the autonomous acoustic surveillance system to correctly identify the acoustic event, such as a gunshot, are features extracted from the recorded acoustic impulse signal used for classification.The acoustic pattern of a gunshot can be characterized by two phenomena, muzzle blast, caused by an explosive charge creating hot, high-pressure gases to expand as acoustic energy from the gun barrel, which has a pattern similar to the N letter, and, in case of a supersonic projectile, shock wave, caused by combinations of compression and expansion shock.The shock wave, also called N-wave, for its specific look, propagates in a conic shape in a perpendicular trajectory, away from the bullet trajectory.An example of a muzzle blast and a shock wave is in figure 1.
In figure 1(a), there is an 'N-pattern' characterized for muzzle blast that lasts about 5 ms.The gunshot was taken from a CZ BREN2 rifle gun of caliber 5.56 NATO with an SD ammunition of a velocity of approx.320 m s −1 .Figure 1(b) shows an 'N-wave' corresponding to a shock wave that lasts <3 ms.The gunshot was fired from CZ 75 SP-01 Phantom (9 mm Luger) with a shell velocity of approx.380 m s −1 .A more detailed explanation of the gunshot theory can be found in [2][3][4][5].The acoustic response of a gunshot depends on many factors: the barrel caliber, barrel length, type of bullet and its velocity, the chemical properties of the propellant, or environmental conditions such as air temperature, air humidity, wind speed, and direction.All these factors affect the resulting pattern.In addition, the environment where the shot was fired may also affect the shape of the recorded pattern.In dense areas, reflections and diffractions caused by objects or the ground itself can significantly alter the pattern [6].Since gunshot characteristics are a complex problem and are determined by many factors, commonly used methods for feature extraction from acoustic impulse signals based on the time or frequency domain cannot be reliably applied.Considering the gunshot as an impulse signal, traditional frequency domain signal processing methods such as spectral analysis rather provide the acoustic characteristic of the surrounding environment than the pattern of a gunshot [7].The characteristic N-pattern of a gunshot is also affected by many phenomena that cause it to change due to, for example, nonlinear dispersion effects.In case of a supersonic projectile, the shock wave completely dissipates once the projectile velocity drops below the supersonic velocity or hits an obstacle.In dense areas where multiple acoustic reflections may occur, this can further prevent reliable application of wavelet methods [8,9].Furthermore, examining the power spectra of gunshots of different calibers recorded at the same distance precludes the use of a simple filtering method [10].
This article directly follows on from the authors work presented in [1], where authors introduced the complete system for the detection, localization and classification of impulse acoustic events.It has been revealed the gunshot pattern due to its complexity cannot be reliably detected using common time or frequency domain methods.Therefore, the authors introduced novel two stage detection algorithm.Firstly, the recorded audio signal is pre-processed in the time domain with the algorithm based on a median filter, which is able to detect the impulse in the signal to the signal-to-noise ratio up to 5 dB.The second stage perform calculation of the melfrequency cepstral coefficients (MFCC) which are used as a features for the classification of the impulse event, a gunshot more specifically, into individual calibers.The presented system is fully automated, is capable to localize and identify the source of the event and it does not require a human operator, therefore its operational costs are incomparably lower in comparison with commercially available systems [11][12][13][14].
In the article [1], it has been also shown the signal processing method based on MFCC originally used in other acoustic applications can be reliably used for a gunshot recognition.The method was tested on a various gunshots taken in the same military area shooting range with the promising results.
In this article authors comparing the effectivity of the various frequency cepstral coefficients (FCC) in a gunshot detection and recognition applications and testing the optimal frame length with an acoustic signal going to be processed with the FCC.The emphasis has been put not only on a variety of and impulse acoustic event but also on a diversity of an environment where the events has been recorded.The introduced methods are MFCC, Inverse MFCC (IMFCC), Linear-FCC (LFCC), and Gammatone-frequency Cepstral Coefficients (GTCC).These FCC showed very good performance in speech recognition tasks, but recently it has also been used in various acoustic applications such as music genre classification, animal sounds classification, or environmental sound recognition.
Since the recorded audio signal is not a stationary waveform, framing the signal must be performed as a first step of every feature extraction method.This enables the representation of non-stationary audio waveforms, by their nature, as stationary frames of the signal.In commonly used methods for audio signal processing such as speech, music genre, animal, or ambient sound recognition, the typical frame length is from 10 ms to 40 ms [15][16][17][18][19].
Therefore, the optimal length of the frame for feature extraction using cepstral analysis is also part of the experiment.Taking into account the time duration of all gunshot phenomena, including reflections and time delays between muzzle blast and shockwave for measured bullet speeds and distances and with regard to a variety of false alarms, three different sizes of the frames with the acoustic signal have been tested, 15 ms, 30 ms and 50 ms.
Presented experiments use the extracted features by the four different FCC to identify and classify various acoustic impulse signals into false alarms and gunshots, and in the case of a gunshot, into the individual calibers.As a classifier, a support vector machine (SVM) classifier and a neural network (NN) are used, and a metric for their success evaluation includes Accuracy, Recall and Matthew's Correlation Coefficient.
The paper is organized as follows.Section 2 introducing the selected methods for the feature extraction.Section 3 explaining data acquisition and their processing, the examples of the impulse events are shown and detailed description of the methods is presented and discussed.Section 4 contains the results of the experiment followed by Section 5, where the results are discussed.The conclusion and future work directions are outlined in the section 6.

Feature extraction method
The analysis of the gunshot pattern described in the Introduction section revealed its complexity.It has been shown that conventional methods for feature extraction may not work reliably.Therefore, more complex processing offering the advantages of both time and frequency domains shows the direction of reliable identification of a gunshot and its classification into individual calibers.This can be done using methods based on a cepstral analysis.Cepstral analysis combines both time and frequency domains, and cepstral features possess several advantages, including source-filter separation, conciseness, and orthogonality, making them convenient for training machine learning algorithms [20][21][22].It already proved its effectiveness in feature extraction in another acoustic field, speech-oriented applications.
The best-known and commonly used method for feature extraction based on cepstral coefficients which apply fast Fourier transform (FFT), filters, and power spectra, is MFCC.MFCC has a wide range of uses in audio spectral characteristics applications focusing on the identification of speech and speaker, music-genre classification, or classification of sounds in general [23][24][25][26][27][28].These cepstral coefficients are based on a mel-frequency spacing of the filters bank and its energies.It mimics the audio perception of a human ear.The filter bank applied for feature extraction can use the base mel-frequency filter bank or its modifications, LFCC or inverse mel-frequency (IMFCC) filter banks [29].Another two cepstral coefficients, derived from Linear prediction coefficients, are linear prediction cepstral coefficients (LPCC) and perceptual linear prediction coefficients (PLPC).LPCC is mainly used in noise elimination [30] or music genre classification [31], or speech recognition [32].PLPC involves critical band spectral resolution, equal-loudness curve, and intensity loudness power law [20].These cepstral coefficients are mainly used for animal sound classification [29], emotion identification [30], and speech recognition systems [33].The last two cepstral coefficients are both derived from the human auditory response and the analysis performed in the cochlea [34].Greenwood function cepstral coefficients are primarily used for animal and bird sound identification and classification [20], and GTCC are considered the most noise-robust features in audio recognition applications.The extraction of GTCC is similar to MFCC but is based on the gammatone filter bank, which output is the frequency-time domain representation of an acoustic signal.Unlike MFCC, the GTCC are more robust against noise.The main applications of GTCC are environmental sound identification and speech recognition [35].
To summarize the above, the most efficient cepstral coefficients in environmental and general ambient sound recognition and classification are MFCC and its modifications and GTCC.In these applications, some studies favor GTCC, some studies favor MFCC over GTCC [34,36], or [37].
This study focuses on an efficiency comparison of these methods and their modifications in gunshot identification and classification.It also exploring a different sizes of frame which is used for features calculation.All the cepstral coefficients feature extraction methods used the same approach as shown in figure 2.
The input audio signal is firstly split into short time frames.The frames are usually set to be from 10 ms up to 50 ms long, and it can have an effect on the resulting effectivity of the extracted features for classification [38][39][40][41][42][43][44].Subsequently, FFT is calculated.To minimize the discontinuity of the framed signal, each frame is windowed by multiplying each frame by a hamming window.The spectra are then filtered by a filter bank, calculating a nonlinear rectification function to the filtered signal (log or power function), and, finally, the discrete cosine transform is performed.As a result, cepstral coefficients, which will be served as features extracted from audio signals for later classification, are calculated.

Data acquisition and processing
To assess the effectiveness of the proposed feature extraction methods, multiple acoustic impulse events, including gunshots and false alarms, which pattern is similar to gunshots, have been recorded and processed.For the purpose of verifying the correct recognition of gunshots from other impulse environmental sounds, a diverse impulse acoustic event having a similar pattern to a gunshot has been selected to provide the basis for false alarms.The emphasis has been placed on selecting events that can be mistaken for a gunshot.
All recorded events has been captured by PreSonus PRM1 electret microphone and digitized by Rubix 44 USB audio Interface with 24-bit analogue-to-digital converter and with the sampling frequency of f s = 48 kHz.The distance between the acoustic event and the microphone was approx.from 20 m to 60 m in the case of gunshots and an in much closer distance of approx.units of meters in a case of false alarms to achieve similar amplitude intensity as for gunshots.About 500 impulse acoustic events have been recorded and processed to test the individual feature extraction methods based on cepstral coefficients.The overview of all events and their amounts is in table 1.In the case of gunshots, there is also a bullet muzzle velocity v presented.
A simple impulse-type pre-detection algorithm based on a modified median filter preprocess recorded and digitized acoustic signal and splits it into the adjustable frame length.More details about the algorithm implementation can be found in [1].Measured acoustic patterns example for all four tested gunshots are presented in figure 4. The figure shows the detail of a filtered acoustic signal where the peak was detected and its near vicinity with no reflections.
The selected false alarms are represented by hand claps, bubble wraps popping, hand slams, door slams and other kind of slams.In general, the events which have the similar acoustic response as a gunshot.All of the other sounds have been recorded by the authors using the same way as in the case of gunshots.
An example of the gunshot pattern and selected false alarm similarity is presented in figures 5 and 6. Figure 5 shows a recorded acoustic pattern corresponding to subsonic shots (5.56 NATO SD ammunition) and its details-muzzle blast at  times approx.0.05 s (taken at a distance of 50 m) and 0.4 s (taken at a distance of 30 m).
While figure 6 depicts the acoustic pattern of a door slam recorded from a much closer distance of approx.4 m.An Nshaped pattern typical for a gunshot is clearly visible in a time of approx.0.2 s.
Upon comparing both Figures, it becomes evident that the acoustic pattern produced by gunshots and that of a door slam could be mistaken for each other if we solely rely on detecting the N-shaped pattern.Last but not least, figure 7 shows an example of recorded gunshot taken in a noisy environment.
Again, it can be seen, without a particular signal processing it cannot be possible to detect correctly a gunshot.
As a part of the experiment, different lengths of the frames containing the detected acoustic peak are considered.From the complexity of a gunshot, the speed of bullets (from speed of the sound for temperature of 30 • C v 30 • C ≈ 350 m s −1 to a maximum supersonic speed bullet speed 380 −1 ) used in the experiment and the distances between the shooter and the recording device (up to 60 m), the shortest frame size has been set to 15 ms.Taking into account the reflections and the acoustic pattern length of false alarms, additional two frame sizes of 30 ms, and 50 ms, have been tested to compare the effectivity of the presented feature extraction methods.
After framing the signal, the frames are multiplied by a hamming window, and FFT is performed.Using the FFT and by applying a filter bank, individual FCCs are calculated as shown in figure 2. The previous experiments with acoustic signals [1,38,42,45,46] showed the optimal numbers of filters are about 20-30 filters.Therefore, the filter bank with 26 filters was designed and applied.The number of filters was chosen as a trade-off between classification accuracy and filter bank calculation complexity.The bandwidth of the filter bank starts at f min = 1 Hz and ends at f max = 24 kHz for all four methods, MFCC, IMFCC, LFCC, and GTCC.The implemented filter banks based on mel-frequency are shown in figure 8.The frequency characteristic of the Gammatone filter bank is in figure 9.
The implementation of all functions and algorithms described above been done by authors in the MATLAB software.More details about the implementation can be found in [47,48].
The calculated 26 cepstral coefficients are then used as features for a SVM classifier with a polynomial kernel of second order and a NN with 20 layers (an input layer, 18 hidden layers, and an output layer).The layers were trained using the Levenberg-Marquardt training method and MATLAB NN training tool [49].Both designed classifiers were tested for an effectivity comparison of the four different FCC in acoustic impulse events detection.

Results
As shown in table 1, four types of gunshots and various types of false alarms with similar pattern to a gunshot have been recorded, processed and classified to test the efficiency of the individual cepstral coefficients and the optimal frame length of the input data.In this way, five classes for the multi-label classification task are set: 0 (false alarms), 1 (9 mm), 2 (5.56 NATO SD), 3 (7.62 mm Tokarev), and 4 (.22).The 26 extracted features for four FCC methods for each sample has been divided into training and testing data sets.
Approx.75% of samples from each class (four gunshots classes and false alarms class) were used for training and validation, and the rest (≈25%) were used for testing.The training process was performed using the MATLAB Neural Fitting tool, where the data division to training, validation, and testing groups was random.The Levenberg-Marquardt optimization algorithm for training was selected because of a relatively limited number of data.mean squared error was has been used as a performance measure, and MEX function for calculations [50].The resulting NN was trained after 370 iterations.The overview of all sample numbers used for testing and training is in table 2.  The experiment results for all the feature extraction methods, and both classifiers (SVM and NN) are presented in the following figures and tables.
Figure 10 depicts confusion matrices of all four FCC feature extraction methods for the SVM classifier and for all three frame lengths of 15 ms, 30 ms and 50 ms.
Based on the confusion matrices calculated above, some patterns of the methods are clearly visible.The mel-frequencybased cepstral coefficients have similar results.On the other hand, Gammatone-FCC show a much more promising score.Also, it is visible that with the increasing frame length, the classification is more successful.The resulting confusion matrices for the NN classifier are in figure 11.
These confusion matrices confirm the results obtained with the SVM classifier.It shows the dominance of GTCC over the mel-frequency type cepstral coefficients.Also, it reveals that the NN tends to classify better than SVM for some of the classes in the performed tests.
Using the proper metrics calculated from confusion matrixes, it is possible to closely examine the results and understand the presented feature extraction methods and the different frame lengths' overall effectiveness.This allows to reveal which methods and its parameters are more efficient in our case of usage.
where T stands for true, F stands for false, P stands for positive, and N stands for negative.Tables 3 and 4 present results for the first two classes, 9 mm, and 5.56 NATO SD ammunition, with a velocity close to the speed of the sound and with a similar pattern to each other (see figure 4).
It is evident that all three mel-frequency type cepstral coefficients have very similar results.In comparison, GTCC achieved a much better score in almost all cases.Also, it is visible that with the increasing frame length, classification success is getting more accurate.On the other hand, a longer frame could contain more reflections typical for the place where the sensor is located, and, therefore, it will include an acoustic impulse response of the local reverberant environment.This will cause the frame to more likely contain more information about the acoustical surroundings than the gunshot itself.The achieved results also favor NN over the SVM classifier in most cases.
Tables 5 and 6 show results of another two resembling gunshot patterns-7.62Tokarev and .22ammunitions.The results are presented in Tables 5 and 6 confirm the superiority of the GTCC over the other FCC and the slightly better classification success of the NN and for a longer frame length.
It can also be seen if there is no variability in an environment and the same gun is used, the results are significantly better than in cases of 9 mm and 5.56 NATO SD classes taken by multiple firearms.
The overall worst results from all five classes are shown in table 7. It is caused by the significant variability of recorded false alarms and its intended similarity to gunshot patterns of all types.Nevertheless, the same scenario as in gunshottype classes is apparent.The GTCC gives the best results from all four feature extraction methods, confirming the superiority over the Gammatone-based methods.Again, longer frames are more successful in the classification.Since most of the false alarms have been recorded in closed spaces, thus its recordings contain many reflections.Therefore, longer frames include more extracted patterns for a classifier.
In case of need for individual false alarms classification, like glass breaking or a human scream from another impulse event like hand claps or various slams, a significantly larger dataset containing a sufficient amount from each of these individual events has to be recorded and used for a proper training of a classifier.However, results similar to those achieved for a gunshot classification can be expected for individual cepstral coefficients.On the other hand, the optimal frame length can vary since some of these events can last for a longer period of time.From this, it can be said that the longer frames should offer better results.
However, in this article, the focus is on identification and classification effectiveness using various cepstral coefficients   of gunshots only, and thus, the classification success into individual false alarm classes is not subject to this study.

Discussion
All five tested classes indicate the superiority of the GTCC feature extraction method over the mel-frequency type feature extraction methods.The mel-frequency type triangular magnitude response filter cuts the frequencies outside the filter in contrast with the smooth shape of the Gammatone magnitude filter, which has an increased overlap of the filters and includes more spectral information about the acoustic signal.The Matthews' correlation coefficient is well over 0.90 for GTCC in almost all cases for both classifiers.This is confirmed by very high Recall values, which indicate the  The classification algorithm also tested the optimal length of the frame with the detected acoustic impulse event for given distances between the sensor and the event/gunshot.The results revealed that longer frames showed better results across both classifiers.The optimal frame length for acoustic gunshot detection depends on the distance between the sensor (up to ≈100 m) and the shooter; for a bullet's speed, traveling about the speed of sound should be about 30-50 ms long.The longer frames can cause the extracted features to contain more likely information about the acoustic surroundings than about the gunshot itself.On the other hand, a frame that is too short can lose some of the critical information about the event, skipping some of the phenomena produced by a gunshot.This is, again, confirmed by the resulting RLC and MCC values.
Finally, the NN classifier produced better results than the SVM classifier across almost all the frame lengths and all four types of feature extraction algorithms.
In general SVM find linear or in our case a nonlinear (the second order polynomial kernel) decision boundaries that separate classes based on the support vectors, which may not capture as much complexity.On the other hand, the NN can learn complex decision boundaries that can capture intricate relationships within the data.Another possible reasons can be NN have a greater capacity to learn from noisy data and generalize well to unseen examples.It needs to be said the classification success of the classifier depends on the size and variability of the training set.However, under the same conditions, the NN will surpass the SVM classifier slightly.

Conclusion
The article introduced the classification of acoustic impulse signals-gunshots, and compared classification success across four feature extraction methods based on cepstral coefficients.The tested methods were MFCC, IMFCC, LFCC, and GTCC, which were primarily used for speech or ambient sound recognition applications.Various frame lengths are also presented as input parameters for the feature extraction method.Three different frame lengths of 15 ms, 30 ms, and 50 ms were tested to show the optimal frame length.Finally, two machine learning classification algorithms were implemented: SVMs and NNs.The experiment included four different gun calibers: 9 mm, 5.56 NATO SD, 7.62 mm Tokarev, and .22,and a set of false alarms consisting of impulse events with similar acoustic patterns to gunshots.These false alarms included slams, slaps, bubble wraps popping, etc.The experiment results revealed the dominance of the GTCC classification accuracy compared to the others, mel-frequency type cepstral coefficients.Moreover, the results showed the optimal frame length with the detected acoustic impulse event to be 30-50 ms.Last but not least, the NN outperforms the SVM classifier in almost all cases.In conclusion, it can be assessed that GTCC gives the best results in a non-speech recognition application, such as gunshot detection.
In the future, the authors will focus on a more significant variability of weapon calibres and a greater variety of environments where shooting will be carried out and focus on an NN design.

Figure 2 .
Figure 2. The block diagram of the feature extraction algorithms.

Figure 3 .
Figure 3.The scheme of the shooting range where the experiments were taken.

Table 1 .Figure 4 .
Figure 4.An example of measured patterns of a gunshots.

Figure 5 .
Figure 5. Acoustic pattern corresponding to a 5.56 NATO SD ammunition.

Figure 6 .
Figure 6.Acoustic pattern corresponding to a door slam.

Figure 7 .
Figure 7. Signal corresponding to a gunshot taken by a .22caliber in a noisy environment.

Figure 10 .
Figure 10.Confusion matrixes for SVM classifier and all three frame sizes.

Figure 11 .
Figure 11.Confusion matrixes for NN classifier and all three frame sizes.

Table 2 .
An overview of samples.

Table 3 .
Classification results for a 9 mm class.

Table 4 .
Classification results for a 5.56 NATO SD class.

Table 5 .
Classification results for a 7.62 mm Tokarev class.

Table 6 .
Classification results for a .22class.

Table 7 .
Classification results for a False alarm class.