The Implementation of Speech Recognition using Mel-Frequency Cepstrum Coefficients (MFCC) and Support Vector Machine (SVM) method based on Python to Control Robot Arm

In this paper describe an implementation of speech recognition to pick and place an object using Robot Arm. To get the feature extraction of speech signal used Mel-Frequency Cepstrum Coefficients (MFCC) method and to learn the database of speech recognition used Support Vector Machine (SVM) method, the algorithm based on Python 2.7. The data learning which used to SVM process are 12 features, then the system tested using trained and not trained data show the best agreement to identifying the speech recognition. The speech recognition system has been implemented for control the 5 DoF Robot Arm based Arduino microcontroller to doing task pick and place the object.


Introduction
Speech control or usually called as Speech Recognition is the method to controlling something by human voices/speech. This method usually used for robotics system to help disability people or other aim. To develop speech recognition needed a method to identify speech signal, they are; feature extraction and machine learning.
In this study will be describe a signal voice processing by using Mel-Frequency Cepstrum Coefficients (MFCC) and Support Vector Machine (SVM) method based on Python 2.7. Finally, the system will be implemented to control 5 Degree of Freedom (DoF) Robot Arm for pick and place an object based on Arduino microcontroller.  The paper is organized as follows. In section 2, described the theoretical background of MFCC and SVM on details. In section 3, describe method and system design. In section 4, described a hardware design of arm robot. In section 5, described application of speech recognition in detail. Finally, in Section 6 the concluding remarks are given.

Feature extraction using Mel Frequency Cepstrum Coefficient (MFCC) method
Mel Frequency Cepstrum Coefficient (MFCC) is a method of feature extraction of voice signals. Feature extraction is the process of determining a value or vector that can be used as an object or an individual identity. MFCC is the most used method in various areas of voice processing field, because it is considered quite good in representing signal [12].
Feature is the coefficient of cepstral, the coefficient of cepstral used still considering the perception of the human hearing system. The workings of MFCC are based on the different frequencies that can be captured by the human ear so as to represent the sound signals as humans represent them. MFCC process block diagram can be seen in Figure 1. In the process of speech signal pre-emphasis filter is required after the sampling process. The purpose of this filtering is to obtain a smoother spectral form of speech signal frequency. In other words, this filtering process is done to reduce noise during sound capture. Where the spectral shape is relatively high value for low areas and tends to fall sharply to the area of frequency above 2000 Hz [18]. The pre-emphasis filter is based on the input / output relationship in the time domain expressed in the following equation: From Equation 1, a is a pre-emphasis filter constant, it is usually 0.9 < a < 1.0.

Frame blocking.
In this process, the sound signal is segmented into multiple overlapped frames, so there is not a single deletion of signals. This process will continue until all signals have entered into one or more frames as illustrated. Voice analysis was done by short-time analysis. The x[n] long voice signal is divided into a number of frames. One frame has N voice data sample. Between one frame with another frame overlapping each other a number of M samples of voice data. The value of M is not more than N that is 2xM.

Windowing.
Windowing is a process for analyzing long sound signals by taking a sufficiently representative section. Windowing is a Finite Impulse Response (FIR) digital filter approach. This process removes the aliasing signal due to the discontinuity of the signal 3

1234567890''""
The pieces. Discontinuities occur due to the frame blocking process. If we define the window as w(n), 0 ≤ n ≤ N -1, where N is the number of samples in each frame, the result of windowing is a signal: From Equation 2 y(n) is the result signal of the convolution between the input signal and the window function and x(n) represents the signal to be convolved by the window function. Where w(n) usually uses window Hamming which has the form:

Fast Fourier Transform (FFT).
A function with limited period can be expressed in Fourier series. Fourier transform is used to convert a time series of bounded time domain signals into a frequency spectrum. The frame that has undergone the windowing process is converted into a frequency spectrum. FFT is a fast algorithm of Discrete Fourier Transform (DFT) which is useful for converting every frame to N samples from time domain into frequency domain. FFT reduces the repeatable multiplication contained in the DFT.
Equation 4 show n = 0, 1, 2, …, N-1 and j = sqrt-1. X[n] is the n-frequency pattern generated from the Fourier transform, Wk. is the signal of a frame. The result of this stage is usually called Spectrum or periodogram.
Where Fmel is the Mel scale and f is the frequency in Hz shown on Equation 5. One approach to the frequency spectrum in the Mel scale with the working function of the human ear as a filter is by Filter Bank. If the F[N] spectrum is the input of this process, then the output is the M[N] spectrum that is the F[N] modified spectrum that contains Power Output of these filters. The spectrum coefficient of Mel is expressed by K, and is specially determined to be 20.
In Mel-frequency wrapping, the resulting FFT signal is grouped into this triangular filter file. The purpose of the grouping here is that each FFT value is multiplied against the corresponding filter gain and the result is summed. Then each group contains a certain amount of signal energy weight as expressed as m1…mp. The process wrapping to the signal in the frequency domain is done using the Equation 6.
Where i = 1, 2, 3, …, M (M is the number of triangle filters) and Hi(k) is the value of the itriangle filter for the acoustic frequency of k.
Equation 7 show Cj is the MFCC coefficient, Xj is the power spectrum of Mel frequency, j = 1, 2, 3, …, K (K is the number of desired coefficients) and M is the number of filters.

Machine learning using Support Vector Machine (SVM) method
Support Vector Machine (SVM) introduced first by Boser et al is a popular kernel based discriminative classification algorithm. The concept of SVM can be explained simply as a search for the best hyperplane that serves as a separator of two classes in the input space. SVM have been used for various machine learning, such as; object recognition, speech recognition, handwritten character recognition speaker recognition and language recognition. SVM is a binary classification algorithm, and is comprised of sums of kernel function k(xi, xj). [20] ( ) = ∑ =1 ( , + ) After multiple iterations on the train and test data, the optimal hyper-parameters and regularization constant C select for the SVM.

Methods
The main tools and component used in this research are: Robot Arm, Microphone, Personal Computer, Arduino microcontroller, connections, and others. Algorithm written in Arduino IDLE and Python 2.7. Figure 3 shown generally the process of Speech Recognition to pick and place an object using Robot Arm based on Python 2.7 shown as Figure 3.  Figure 3 explain when the system starts and going to record the speech, the process divided to 2 processes: First process is make a training data, consist by features extraction using MFCC and using SVM Method to classifying the speech "pick" (Ambil) and "Place" (Simpan) in Bahasa. Second, is the testing process, such as MFCC features extraction, then matching with Trained Data. The matching data be processed to obtain speech classification. While the 6 1234567890''"" classification process Robot Arm will be move to pick or place an object as our command. All processes work in real time based on Python 2.7 and Arduino microcontroller. Figure 4 is the design and realization of Robot Arm which used in this research, consist by five motor servo (5 DoF) component which connect with Arduino microcontroller.   Figure 5, each servo has a supply using 5 volt and 100 mA of battery in order get a better result. Each servo "ground" must be connected with ground on Arduino microcontroller. Robot Arm servos divided into some function; Servo1 as base and rotate horizontal, connect to pin 8. Servo2 work as shoulder and rotate vertical, connect to pin 9. Servo3 work as elbow and rotate vertical, connect to pin 10. Servo4 work as wrist and rotate horizontal, connect to pin 11. Servo5 work as gripper to place an object, connect to pin 12.

Features extraction database using MFCC
In this section, to get the robot system which can understand with human speech command, the first step is building the extraction feature database of speech. The speech recognition which used as command to control Robot Arm, they are; Ambil (pick) and Simpan (place) in Bahasa. In this paper, database made from 12 feature extractions and each command made by 10 times of iteration. Table 1 is example of feature extraction of speech recognition database.  Table 1 show that the database consists by 12 feature extractions and the target value. The feature extraction as the identity of each speech recognition. While the target value will become an input for SVM method as database to control the Robot Arm. The target "0" is the value for command "Simpan" (place) the object, while target "1" is the value for command "Ambil" (pick) the object. The collected database classifies by SVM method, then the database called the Trained Data.

Speech recognition system test
Before going to the system test, interface based on Python 2.7 build to get the user friendly when operate the Speech Recognition system shown on Figure 6. The interface consists by menu (Rekam/Record and Keluar/Exit) to operate the program, shell windows to monitoring the result of speech recognition and graphic windows to display the waveform result of the speech recognition.
(a)Waveform for "Ambil" Command (b)Waveform for "Simpan" Command After build the speech database become Trained Data, furthermore the trained data is tested by trained respondent for data clarification. The test results shown on Table 2 at Trained Respondent section have the accuracy average rate of speech recognition by trained respondents (in database) is 80%. While, the respondents not trained (outside database) data produces an accuracy rate of 70%.

Speech recognition implementation to robot arm
Identifying and categorizing the trained data of speech recognition are successfully, then the system implemented to 5 DoF Robot Arm to doing task Pick (Ambil) and Place (Simpan) the object. When the testing the speech recognition to control Robot Arm works well top pick and place the object shown on Figure 7.

Conclusion
This study has been presented to develop Robot Arm which controlled by speech recognition to doing pick and place an object. The speech recognition system based on Python 2.7 using MFCC and SVM method work successfully suitable the speech command. The results obtained speech recognition have a high average accuracy rate of speech recognition, which is 80% of the respondents trained data and 70% of the respondents not trained data. The system implements to 5 DoF Robot Arm based on Arduino microcontroller works effective to pick and place an object. The future works will focus on combination of speech recognition to Social Robot for Human-Robot Interaction.