Feature Extraction of Speech Signal Based on MFCC (Mel cepstrum coefficient)

Smart power plant is to establish a modern energy power system to achieve safe, efficient, green, and low-carbon power generation. Its characteristics are that the production process can be independently optimized, the relevant systems can collect, analyze, judge, and plan their own behavior, and intelligently and dynamically optimize equipment configuration and its parameters. This paper focuses on the optimal recognition state of MFCC in smart power plants. In this paper, we propose that by changing the number of filters and the order of MFCC to view the expression effect of the final MFCC parameter, the evaluation index of the effect is “accuracy”, the evaluation index—accuracy in the neural network. In this paper, the network is built through a Python programming environment, and the comparative experiment is adopted to analyze the influence of each parameter on the speech information expression effect of MFCC parameters.


Introduction
The wide application of intelligent voice control technology in the electric power industry is an inevitable trend in developing intelligent power plants.In the future era of the Internet of Things, the interactive mode of "speaking demand--getting feedback" will be further extended.All electrical appliances will be able to "listen" or even "speak"; voice control will become an important means to build smart power plants.
One of the most important links in speech recognition is the extraction of speech feature parameters.A good speech feature parameter should have five attributes [1] : universality, uniqueness that is not easy to imitate, the stability of not changing with time, robustness without noise interference, and the convenience of being easy to extract.
MFCC speech signal feature was first proposed by Davis and Mermelstein [2] .The main core step of this feature is realized in the frequency domain of the speech signal, which is to convert the frequency axis to the Mel frequency scale and then convert the inverted spectrum operation to the inverted spectrum domain.
In the characteristic binding of MFCC, Guo [3] used the combined feature parameters of MFCC and △ MFCC to identify the weighted VQ algorithm.This combined feature was first proposed by D. Reynolds in 1996; Zhang [4] introduced a new feature parameter LPMCC (LPCC mel transform).This parameter is a fusion of the LPCC and MFCC.And in the paper, we proposed a new extraction algorithm --based on the Fisher criterion extraction algorithm in the MFCC extraction algorithm; Li [5] proposed a dual acoustic feature integration method based on the x-vector architecture.Unlike previous methods that directly stitch the two acoustic features, it lets two different feature parameters first enter two different branches of the neural network.This is an integration method in the previous layer of the statistical pooling layer.
In the study of endpoint detection, Wen [6] took the lead in using the improved double-gate method of calculating the amplitude mean P of the beginning T frame to eliminate noise and, at the same time, using the improved AMDF (cycle short-time average amplitude difference function) method for base sound detection, which improved the accuracy of endpoint detection; Bai et al. [7] is adding the first three components of MFCC, obtaining MFRE (MER energy ratio) as the speech endpoint detection measure.Finally, the fuzzy C mean clustering algorithm delineates the two-threshold threshold.
Regarding the study of MFCC denoising, Li et al. [8] compared the different speech characteristics of MFCC and LPCC in 2010 and concluded that MFCC was better than LPCC in both noise shielding and noise resistance; Zhou [9] proposed a kind of spectral subtraction for the representative vector, through this method to eliminate the influence of music noise in MFCC parameter extraction, after experimental verification, this method can better eliminate burr noise; Hu et al. [10] is in extracting the MFCC FTT step to the harmonic characteristics of the turbidity spectrum by noise compensation reconstruction method to reconstruct the frequency spectrum; Jin et al. [11] combines Speech-denoising Wavenet with MFCC and adds it to the void convolution layer through a fully connected layer and a convolution layer of the original model.
In the algorithm improvement of MFCC extraction, Wen et al. [12] proposed to cite the new technology MidMFCC parameter in the emotion field.The characteristic of this parameter is as follows: The high-frequency part of IMFCC at the frequency of 0~2000 Hz, and the low-frequency part of MFCC at 2000 at the frequency of 2000~4000 Hz.Through this feature, the frequency characteristics of the recognition rate of the system can be improved to a certain extent.

The feature extraction of the speech signals
The extraction process of MFCC is as follows: First, the extracted speech data is preprocessed, then the fast Fourier transform (FFT), convert the frequency scale in the Mel filter set, and finally take the logarithmic and discrete cosine transformation (DCT) [13] .The specific process is shown in Figure 1.

Figure 1 Flow chart of MFCC extraction
The speech set used in this paper is the thches_30 speech library disclosed by Tsinghua University, which was recorded from 2000 to 2001.It records 10, 893 sentences from 30 boys and girls for training.All the samples in the voice library are recorded through a charcoal grain microphone while keeping the recording environment in a quiet office.The speech format is WAV format, the sampling frequency is 16 kHz, the sampling size is 16 bits, and the language type is Chinese.

Pretreatment
The preprocessing process is shown in Figure 2: Figure 2 The basic process of speech signal analysis and processing Digital processing: The voice signal is through the bandpass filter, and then the voice signals every interval amplitude value, the original signal from the time domain continuously goes into the continuous signal.Only finally through A/D transform voice signals into binary digital code, will sample the amplitude value that becomes discrete, according to different quantitative order distance classification and the same amplitude value.
Pre-aggravating treatment: It uses a first-order digital filter to increase the power of the highfrequency signal above 800 Hz to keep the speech signal flat in the subsequent processing, using the same SNR spectrum with the formula: 1  .
Windowing and framing: The Hamming window is selected as the window function for window adding and frame processing.The hamming window function is: where N is the window length.The framing process is the process of adding windows.According to the sampling period, the length of the window is determined, and the voice signal is divided (weighted) so that each segment is added to a selected window.The definition is as follows:     *   .
(3) Endpoint detection: Short-time energy and short-time over-zero rate methods are adopted to remove high-frequency ambient noise and low-frequency signal interference and carry out endpoint detection.We let the speech signal of the pretreatment  frame be   , and its short-time energy is expressed by  .The calculation formula is as follows: ∑   .
(4) The short-time over-zero rate distinguishes constant points according to the number of levels of the waveform.In the specific calculation, the discrete signal is generally used.We let the short time over zero rate symbol of speech signal   be  , which is defined as: where  is a symbolic function, specifically expanded by: After digital processing, pre-aggravation, window adding, and frame segmentation, the voice time domain signal   for each frame is obtained.

Fast Fourier transform (FFT)
The preprocessed speech time domain signal   performs fast Fourier transform (FFT), transforms the time domain signal into frequency domain signal, obtains the spectrum   , and obtains the amplitude spectrum |  | .We let the voice signal DFT be:

Mel filter bank
The resulting amplitude spectrum |  | through the mel filter sets   , its output is  , the formula is as follows:

Discrete cosine transform
Finally, the parameter  is discrete cosine transformation and converted into the inverted spectrum domain, and finally gets the MER inverted spectrum coefficient  , which is the MFCC characteristic parameter to be extracted.The specific formula is as follows: ,  0, 1, 2, . . ., . (10)

Pre-processing results
The Python code reads the original speech of the selected samples, and the energy changes over time are presented as a waveform map.Its waveform map is shown in Figure 3.As can be seen from Figure 3, part of the speech signal without any processing has very high energy.A pre-aggravation operation is required here to ensure that the same signal-to-noise ratio can be used in performing the speech signal spectrum conversion, where the coefficient  is selected is 0.97.After preaggravation, the overlapping frame method will be used to window the voice signal and frame operation.The form is shown in Figure 4, where N is the frame length, and M is the frame move.
Figure 4 The overlapping frame method

The CNN network model construction
In order to make better use of the above data set for training and testing, one-dimensional CNN is used to build the model.There are four convolution layers, in which the convolution kernel length is 1, 3, 3, 5, the activation function selects "relu", the optimizer selects "adam", the classifier selects "softmax", and the evaluation criteria are "accuracy".The specific network operation process structure is shown in Figure 5.

Exploring the influence of C0 and data set division in MFCC
In order to better display the specific feature form of MFCC, the selected preprocessed samples in Figure 4 are subject to 3D visualization after extracting the MFCC feature parameters, where the abscissa represents the order of MFCC, the ordinate represents the frequency number, and the vertical coordinate represents the inverted spectrum value.As shown in Figure 7 (left), the C0 value of the first dimension of MFCC is very large, and in general, it is not used as part of the characteristic parameters.Usually, C0 will be replaced with the log value of energy.The replaced MFCC parameter is shown in Figure 7 (right).It can be seen that the MFCC, after replacing C0, is more consistent with the distribution of the inverted spectrum value.In order to explore the influence of C0 substitution and data set division on the identification effect of the test set, the data set is divided into four groups, and the ratio between the training set and the test set is divided into 1:1, 2:1, 3:1:4:1.The experiment compares the two MFCC feature parameters, and the experimental data are shown in Table 1.  1, after replacing C0 with log value, the test set recognition rate is better than that of no replacement because the previous C0 direct component is too large, affecting the identification efficiency.

Analysis of the number of Mel filters and the characteristic order influence
Because the information expression performance of the extracted MFCC speech signal feature parameters contained in the speech is related to the number of band pass filters  contained in the band pass filter group and the MFCC order  set when performing the DCT transformation.Therefore, 10000 speech is divided into a training set of 7500 speech and a test set of 2500 speech, under different  and  conditions the data set is imported into the set model for training, and then the trained model to identify the test set, the total accuracy of 25 people in the sample is recorded, The experimental results are presented in Table 2.As can be seen from the above figure, when the order  of MFCC remains unchanged, the change of accuracy caused by changing the number of filters  in the Mel filter group is not large, and there is no substantial improvement or decrease trend.However, when the number of filters M is limited, and the MFCC order P is determined, it can be found that with the continuous growth of , the recognition rate of the test set will also have a large increase.This shows that the last step in the process of MFCC extraction of discrete cosine transformation (DCT) plays a decisive role in the expression of MFCC, the presence of DCT can remove the correlation of the original feature vector, but at the same time, it will truncate the original feature vector, after the DCT change MFCC can get the minimum mean square error, we let the extracted feature vector retain the energy of the original feature vector to the greatest extent.With the increase of order , the truncated original feature vector during DCT will be less, and more speech information contained in the original feature vector can be retained, which is why the test set recognition rate will increase with the increase of order .
Meanwhile, it can be seen in Figure 8 that the recognition rate increases with the MFCC order , but the curve changes very little and remains constant after  24.When MFCC order reaches a certain degree, although the test set total recognition rate will still increase with the increase of orders, the promotion effect is relatively less.This means the original feature vector is short enough when the retained voice information is truncated to a certain degree.MFCC has included basically all the unique information of the speaker, continuing to increase the order , to reduce the truncated part can increase the recognition rate.And it is very low.
In realizing the function of voice print recognition, in addition to the total recognition rate, the operation speed of the model itself is also a part of the function that needs to be given enough attention.From the above experiment, the recognition rate is relatively small, affected by the number of filters, and greatly influenced by MFCC order, so the following experiment will test filter number  whether it is unchanged.It will also change the MFCC order , explore the change of model training and test operation time, and set the filter number  to 30.The specific test results are in the following Table 3: As the growth of MFCC order  changes, the operation time of the model will also increase in a linear growth mode, so that the linear growth multiple is , which can be calculated from the data in Table 3, and the growth multiple  is about equal to 20.That is to say, with the growth of MFCC order to a certain extent, although the recognition rate will still increase, the rate of change will decrease, while the operation time required for model training and testing will continue to grow at the growth multiple .
Considering the above, the optimal solution is MFCC order  24..When the MFCC order is 24, accuracy is a relatively high standard.If the order is less than 24, as the order grows, the recognition rate of change is still large, two orders corresponding to the recognition rate gap are obvious.When the order is greater than 24, the two different order accuracy difference is not big, but the gap between the operation time is obvious.

Analysis of MFCC dynamic properties
To test how much effect the difference of MFCC would have on MFCC expression in voice print recognition and the performance of MFCC and its two differences for improvement in noise resistance, artificial noise is added to the original pure speech, and the speech set is set as SNR of 20, 15, 10, 5, 0 for testing.The experimental results are shown in Table 4, and the MFCC parameter was selected for testing in this section.where  represents the effective power of the speech signal, while  represents the effective power of the noise.
As can be seen from Equation ( 11),  will decrease with the increase of the effective power of noise.Therefore, for a segment of speech, the higher the  means that, the smaller the noise is, the purest the sound is, and the higher the sound quality is.When SNR=0, the representative sound and noise power are equal.
In order to more clearly observe the anti-noise performance of each of the MFCC feature parameters, Table 4 is drawn as a bar chart, as shown in Figure 9.As can be seen from Figure 9, with the decrease of SNR, the recognition rate of the test set under the three combinations of feature parameters gradually decreases, but at =0, when the signal energy is equal to the noise energy, the lowest recognition rate can still reach 85.86%, and the average recognition rate is more than 90%, indicating that the anti-noise performance of MFCC itself is very excellent.However, after increasing the first and second-order difference between MFCC, although the test set recognition rate is higher than using MFCC alone as the identification feature parameter, the decreasing trend of recognition rate did not slow down with the increase of artificial noise power.Therefore, according to this phenomenon, it can be concluded that MFCC has excellent noise resistance, and although the combination of adding differences to MFCC can increase the recognition rate of the speaker recognition model, it does not further improve the noise resistance performance of MFCC.

Figure 3
Figure 3 Waveform map of the speech signal

Figure 5
Figure 5 The network structure model for sound print recognition 3.3.Identification model testing based on MFCC and CNN In order to test whether the constructed MFCC extraction program and the CNN speaker recognition program can run normally, the MFCC test experiment was performed first.In this experiment, 26 Mayer filter sets were set in the MFCC extraction program, and the final output of the 13 th -order MFCC feature parameters was obtained.After the samples in 10, 000 speech data sets are preprocessed and the MFCC feature extracted, they can be imported into the built CNN voice print recognition model program.The CNN recognition program is the same as that in Figure 5, and 100 batches of training set samples (including the verification set) are set.The final test results are shown in Figure 6.

Figure 6
Figure 6 Sample model accuracy rate (left panel) and Sample model loss rate (right panel) As seen from the graph line of Figure 6, the accuracy and loss rate of the model in the training period changed greatly.At the beginning of the training, the accuracy of the model rose in an almost straight line, and then with the gradual increase of the training batch, the change rate gradually leveled out.The loss rate also drops almost straight first and then gradually flattens.After 20 training sessions, the accuracy and loss rate did not increase or decrease substantially, which indicates that the model reached the maximum recognition degree.With the increasing number of training batches, the accuracy and loss rate of the final training set are fitted and do not differ much.From the two graphs above, we can determine that the recognition model used and the feature parameters extracted in this paper can be used normally.

Figure 9
Figure 9 MFCC difference and noise on the recognition rate

Table 1
C0 and the effect of the dataset division on the test set accuracy

Table 2
Effect of the number of Mel filters and the MFCC order on the test set accuracy According to the above table, data draw the influence of different Mel filter numbers and change MFCC order of the total recognition rate of the test set, as shown in Figure8.

Table 3
The model operation time changes with the order

Table 4
MFCC difference, as well as the effect of noise on the test set recognition rate signal is mixed with environmental noise, we usually use the signal-to-noise ratio SNR to evaluate the purity of a segment of the speech signal.SNR refers to the ratio of signal to noise in an electronic device.Its unit is often expressed in dB, which is defined as: