Evaluation of Short-Term Cepstral Based Features for Detection of Parkinson’s Disease Severity Levels through Speech signals

Parkinson’s disease (PD) is one type of progressive neurodegenerative disease known as motor system syndrome, which is due to the death of dopamine-generating cells, a region of the human midbrain. PD normally affects people over 60 years of age, which at present has influenced a huge part of worldwide population. Lately, many researches have shown interest into the connection between PD and speech disorders. Researches have revealed that speech signals may be a suitable biomarker for distinguishing between people with Parkinson’s (PWP) from healthy subjects. Therefore, early diagnosis of PD through the speech signals can be considered for this aim. In this research, the speech data are acquired based on speech behaviour as the biomarker for differentiating PD severity levels (mild and moderate) from healthy subjects. Feature extraction algorithms applied are Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC), and Weighted Linear Prediction Cepstral Coefficients (WLPCC). For classification, two types of classifiers are used: k-Nearest Neighbour (KNN) and Probabilistic Neural Network (PNN). The experimental results demonstrated that PNN classifier and KNN classifier achieve the best average classification performance of 92.63% and 88.56% respectively through 10-fold cross-validation measures. Favourably, the suggested techniques have the possibilities of becoming a new choice of promising tools for the PD detection with tremendous performance.


Introduction
Speech is a complicated task that involves parallel and sequential control of numerous mechanisms and systems in a highly detailed and refined ways. In the human anatomy of the speech system, lungs are the primary source of speech production, whereby they produce sufficient airflow through the glottis in order for the vibration of the vocal folds. When the vocal folds vibrates, they generate a source excitation signal holding the properties of pressure wave expelled from the lungs. Next, the source signal passes through the vocal tract to be filtered by the spectral envelope to form a speech signal [1, 2]. For a population that is moving towards increased number of elderly people, Parkinson's disease (PD) is placed in the second place for the most common progressive neurodegenerative disorder in the world, which affects people above 60 years old. According to the statistics revealed by the World Health Organization (WHO), it was expected that the world is currently having seven to ten million people with Parkinson (PWP). For the forthcoming years, this amount is predicted to upsurge by twofold as the age is the leading risk feature for the onset of PD [3][4][5][6]. At present, the cause of PD is still unidentified and there is no cure for PD, although it has been managed by some types of drugs known as levodopa. The characteristic motor features of PD include various kinds of motor and non-motor deficits such as behaviour, resting tremor, bradykinesia, sensation, akinesia, postural instabilities, and hypokinetic dysarthria, which is speech impairment [7]. Current existing techniques for diagnosis of PD are according to the patient's previous health report and neurological inspection conducted either through observation by the neurologist or some technological tools such as electromyography (EMG), electrocardiography (EEG) or brain imaging modalities, which are magnetic resonance imaging (MRI) and computer tomography (CT) scans [8][9][10].
Vocal impairment is one of the initial indicators for the onset of PD, where it is predicted that nearly 90% of people with Parkinson (PWP) show speech deficiencies. Typical symptoms of speech impairment include the inability in pronouncing words, decreased voice tone, dysphonia (failure to produce normal vocal sounds) and dysarthria (inability to articulate speech normally). Speech recording is an uncomplicated and non-invasive method, which is the core benefits of introducing it as one of the notable biomarkers for assessment of PD. Detection of voice changes in PWP would make the possibility of earlier intervention before the onset of disabling physical symptoms. Changes in voice might also function as an easily assessable and objective proxy to determine PD severity and monitor early trajectories of PD progression. Little et al. have proposed different dysphonia measures for the aim of differentiating speech signals of PWP from healthy subjects. These measures have the potentials for enhancing the current technique via speech signals [11][12][13]. Several works conducted by previous researchers were presented in Table 1 that includes voice data acquisition, feature extraction, classification and performance evaluation [14][15][16][17][18][19][20][21][22]. From these studies, it can be observed that there have been initiatives involving the application of speech signals that had the ability to differentiate between PWP and healthy subjects. However, evaluation of signal processing algorithms for speech analysis platform that focuses on distinguishing between PD patients with different severity levels from healthy subjects is still at its infancy. Therefore, with more in-depth researches are encouraged to analyse speech abnormalities in PWP, this study is proposing a computational framework for the ultimate objective analysis of differentiating PWP with mild or moderate severity level from healthy subjects.

Methodology
The overall block diagram of the proposed approach is demonstrated in Figure 1.

Database
From earlier research, several vocal tests have been conducted for evaluating PWP vocal impairments. This includes (a) sustained vowel phonations where the speaker is asked to constantly maintain the vowel to a maximum period, and (b) running speech where the speaker is asked to read simple standard sentences [12]. Although some of the vocal deficiencies in running speech, for instance, the combinations of consonants and vowels can be captured and considered as a more convincing assessment in daily routines, the analysis of running speech is more complicated compared to sustained phonations. On the other hand, the assessment through sustained vowel phonations had the ability to provoke speech impairment symptoms, where identification of dysphonia is best carried out with absence of confounding effects from articulatory or linguistic components of running speech. Both clinical practice and extensive studies showed that the assessment through sustained vowel phonation is adequate for many speech test applications, particularly for PD speech assessment. Generally, sustained vowel phonation test is easy to conduct and also it can give satisfying results in discriminating PWP from healthy subjects [11][12][13]. Therefore, in this research, we will focus on sustained phonation tests.
The speech data collection had been carried out from 30 subjects (14 males and 16 females with an average age of 65 years). 20 participants (8 males and 12 females) are PWP and 10 participants are healthy subjects. Each subject was rated by a neurologist using the Hoehn and Yahr scale where the symptom ratings have ranged from 0 to 4, where 0 = normal, 1 = mild, 2 = moderate, 3 = severe and 4 = unintelligible). Out of 30 subjects, 10 subjects were rated '0', 10 subjects were rated '1', and 10 subjects were rated '2'. The voice phonations of the subjects were recorded using a headset (DW Pro2 series) where the speech files were recorded with 44.1 kHz sampling rate, 16-bit resolution and stored as '.wav' file. Audio speech files were recorded using Matlab Simulink Tools. The experiment was conducted in an auditory room of the Neurology Department, Penang General Hospital. A small briefing about the test was given to each subject before starting the experiment to ensure that the subject can give full co-operation and fully understand the overall recording process. The headset (DW Pro2 series) was fitted on the subject's head; approximately 5cm from their mouth, recording the voice while the subject is conducting their speaking task. The speech database comprised of sustained vowels /a/ and /o/ phonations, where the speaker is requested to produce /a/ phonation for three time and /o/ phonation for three times in a random sequence. For every assessment, the speaker is asked to sustain the phonation to a maximum period, attempting to maintain steady frequency and amplitude.

Pre-processing (End-Point Detection Technique)
In speech signals, some of the segments such as the silent segments would have a much lower amplitude compared to the spoken segments, causing these silent segments with less energy than their spoken energy. Thus, this technique can be applied for differentiating between the speech segments and the silence segments [24]. For this research, segmentation for each voice signal is conducted by segmenting the stable portion from the overall signals. As these signals would normally be stable in the middle of the whole signals, segmenting the central portion is to eliminate the silent periods at the beginning and the ending of the signals. 2s segments from the middle, stable portion of the speech signals were chosen for the subsequent acoustic analysis in order to avoid problems during the onset and offset of the phonations.

Feature Extraction
One of the important tasks to produce produce a better recognition result is extracting the greatest representation of the parameter from the audio signals. The speech signals were first down sampled from a sampling frequency of 44.1 kHz to 16 kHz. Then, the signals were pre-emphasized with the value of the coefficient, α equal to 0.97. This process will upsurge the energy of the signals at a greater frequency. Equation (1) is used for building a pre-emphasis filter.
where C(n) symbolised the pre-emphasized signal, B(n) symbolised the actual signal, and α is the preemphasis coefficient. After pre-emphasis, the process of framing was conducted to divide the speech signals into smaller segments and overlapping is required to capture the subject's specific features in the speech data. Windowing is performed on the framed signal to smooth the abrupt frequencies at the end points of the frames and undesirable frequencies in the speech frames. In this research, the preprocessed speech signal is divided into fixed frames of 20ms with an overlap of 50%. For windowing, Hamming window is chosen as the window shape by taking consideration of the next block in the feature extraction processing chain and integrates all the closest frequency line [25].

Mel Frequency Cepstral Coefficients (MFCC).
The MFCC working principle is based on human hearing perceptions whereby it cannot perceive frequencies over 1kHz. MFCC consists of two types of filters that are spaced linearly in frequency below 1000Hz and spaced logarithmically in frequency above 1000Hz. A subjective pitch is presented on Mel Frequency Scale to extract the significant characteristic in speech signals. MFCC is widely used in audio feature extraction due to its high sensitivity properties and the total spectra slope of the high-order cepstral coefficients. After the windowing step, Discrete Fourier Transform (DFT) is performed for converting each frame of time domain N samples into the frequency domain. This conversion is performed to obtain the desired frequency resolution on a Mel scale. After the computation of DFT on the windowed signals, the resultant signals were squared to get the Mel-scaled filtered energies. Each filter's bank has a triangular bandpass frequency response, whereby the filter output is the weighted sum of its filtered spectral components and the Mel for given frequency f in Hz is computed. Equation (2) is used for converting the frequency in term of Hz into Mel scale. Lastly, the log Mel spectrum is transformed back into time domain using Discrete Cosine Transform (DCT). The result after DCT conversion is called MFCC [25][26][27]. For this research, MFCCs with 40 coefficients were generated to study their impact on sustained vowel phonations classification accuracy.
Linear Predictive Coefficients (LPC). In addition, the second chosen technique for extracting the useful features is a linear predictive analysis. It is one of the important features that reflect the differences in the biological structure of the human vocal tract. It also provides an accurate, reliable and robust method for parameter estimation that characterises the linear time-varying system representing vocal tract. The essential concept of linear predictive is each voice signal is predictable as the linear combination of past p samples, whereby the autocorrelation of p denotes the order of LPC analysis and p is fixed as 13 [28][29][30]. Next, the autocorrelation coefficients will be then be converted into Linear Prediction Cepstral Coefficients (LPCC). This conversion is performed by LPC analysis implemented based on the Levinson-Durbin recursive algorithms. LPCC are the cepstral coefficients of the Fourier transform representation of the logarithmic magnitude spectrum, where LPCC have been proven to be more robust to noise and reliable compared to LPC. LPCC are the recursion of LPC parameters to LPC spectrum. In this work, the LPCC features that represent each frame was determined based on this equation, Q = (3/2)p. Next, a standard method to weigh the cepstral coefficients is performed for reducing the sensitivity of loworder cepstral coefficients and high-order coeficients to total spectral slope and noise respectively. Weighted Linear Prediction Cepstral Coefficients (WLPCC) can be simply obtained by multiplying LPCC with the weighted formula [28,30]. Finally, this coefficient will be normalised between 0 and 1 before proceeding to the classification stage.

Classification
Two techniques chosen as classifiers for this research are Probabilistic Neural Network (PNN) and k-Nearest Neighbour (KNN). PNN is a feed-forward neural network, which was derived from the Bayesian network. PNN architectures. This network is described as an application of statistical algorithm called Kernel discriminate analysis where the procedures consist of 4 layers: an input layer, hidden layer, summation layer and the output layer. The advantages of PNN are fast training procedure, permanently parallel structure and confirmed to converge to an optimal classifier, where the increased representative training set and sample size can be added or removed, without any additional retraining. PNN classifier learns more quickly than other neural network model and had favourable results in many applications. With these reasons and advantages, PNN can be considered as a supervised neural network that had the capability to be used in this PD classification system [31,32]. The accuracy of PNN highly depends on suitable smoothing parameter or spread factor (ŋ). Appropriate ŋ value was found between 0.01 and 0.1 through experimental investigations.
Next, KNN is a supervised learning algorithm where the outcome of the new instance query will be grouped according to the majority of the value of k in this algorithm. It is a very popular and one of the most fundamental classification algorithms that demonstrate excellent performance characteristics with a short period of training time. The classification is using majority vote theory among the class of the k objects. With a given query point, the k number of training points that represents the number of nearest neighbour points closest to the query point will be discovered. The KNN algorithm used neighbourhood classification as the prediction value of the new query instance. The label of a class is determined by the KNN category using majority voting. The effect of the different neighbourhood in the classification results was studied by varying the k values from 1 to 10 [33].

Results and Discussion
There was a total of four features used in this research that are 40 MFCC features (MFCC-40), 13 LPC features (LPC-13), 19 LPCC features (LPCC-19) and 13 WLPCC features (WLPCC-13). In this research, 10-fold cross-validation was performed on the features extracted where the data is randomly divided into 10 equal pieces. Each selected piece is chosen as the test set with training done on the remaining of the data. The cross-validation is then repeated for 10 times (folds) with each subsample used exactly once as the validation data. The advantage of this technique compared to conventional validation is that all observations are used for both training and validation, and each observation is used for validation exactly once. Herein, 10 subjects per group with 3 trials and 2 sustained vowels per features which resulted in a total feature vectors of 60 X 40 MFCCs, 60 X 13 LPCs, 60 X 19 LPCCs and 60 X 13 WLPCCs were analysed. The mean classification results for 3 severity levels, whereby '0' representing healthy, '1' representing mild and '2' representing moderate together with an average of the 3 classes are tabulated in Table 2 and Table 3. From Table 2 and Table 3, it is clearly shown that PNN with a smoothing parameter, ŋ equal to 0.09 showed slightly better performance measure for all the features extracted compared to KNN. As seen in Table 2, the mean accuracy achieved for LPC-13 features using KNN classifier for class '0' is 85.67%, class '1' is 84.90%, class '2' is 85.20%, and average among the 3 classes is 85.26%. On the other hand, as seen in Table 3, when LPC-13 features are used in PNN classifier, the mean precision for class '0' obtained is 89%, class '1' is 88.07%, class '2' is 86.91%, and an average of the 3 classes is 88%. For LPCC-19 features, as seen in Table 2, when LPCC-19 features are used in KNN classifier, the average accuracy obtained for the healthy class is 86.47%, the mild class is 85.26%, the moderate class is 85.57%, and an average of the three classes is 85.77%. On the other hand, as seen in Table 3, when LPCC-19 features are used in PNN classifier, the mean precision achieved for the healthy class is 90.13%, the mild class is 89.53%, the moderate class is 87.01%, and average among the 3 classes is 88.89%. For WLPCC-13 features, as seen in Table 2, when the WLPCC-13 features are used in KNN classifier, the average accuracy obtained for class '0' is 86.98%, class '1' is 85.57%, class '2' is 85.81%, and an average of the three classes is 86.25%. However, as seen in Table 3, when WLPCC-13 features are applied in PNN classifier, the highest average accuracy obtained is 93.11% from class '1' followed by 91.94% from class '0' and 90.27% from class '2'. The average among 3 classes for PNN applying WLPCC features is 91.78%. It is undoubtedly seen that PNN classifier achieved better performances compared to KNN classifier when using LPC based features that consist of LPC, LPCC, and WLPCC features.
In addition, for MFCC-40 features, PNN classifier also performed well compared to KNN classifier with 92.46%, 90.94%, and 89.04% of average accuracy for class '0', '1' and '2' respectively. These results go the same for the combination of all LPC based features, where these features showed the mean accuracy of 93.97% for class '0', 92.81% for class '1' and 91.90% for class '2'. Among all the features, the combination of all LPC features presented the highest mean classification results for average among the 3 classes of 92.63% using PNN classifier and 88.56% using KNN classifier. Although it is difficult to make direct comparison between present approach with our proposed techniques, as all the approaches are unique in term of signal analysis, selecting feature extraction and classification framework, we have put the summary of performance from various approaches conducted for the diagnosis of PD in Table 1. As shown in Table 1, previous work done on speech signals has only presented results on distinguishing between healthy controls and PWP through some dysphonia features. However, for classification of healthy subjects from PWP with the mild and moderate level of PD, to the best of authors' knowledge, there has been very little analogous published methodological framework conducted. In our present study, we have applied two widely used shortterm cepstral-based features in audio feature extraction, which is MFCC and LPC based features (LPC, LPCC, and WLPCC) to differentiate PWP with mild and moderate severity level of PD from healthy subjects, where KNN classifier achieved highest mean accuracy of 89.80% and PNN classifier achieved highest mean accuracy of 93.97%.

Conclusion
PD is grouped as the second commonest neurological illness after Alzheimer. Researches have shown that voice signal may be useful for PD diagnosis based on the sources of clinical evidence, which suggested the majority of PWP usually reveal some form of vocal disorder. In this study, techniques using two types of classifier, KNN and PNN for distinguishing PWP with mild and moderate severity level from healthy subjects through voice signals by extracting a number of features were conducted. The experimental results presented the best mean performance was reached when extracting the combination of LPC based features (LPC, LPCC and WLPCC) of 93.97% for PNN classifier and 89.80% for KNN classifier. Therefore, it can be summarized that this proposed approach is a good starting point to present the performance for efficiently classify PD severity levels (mild and moderate) from healthy subjects through speech features. The existing works can be further enhanced by increasing the number of PD severity levels into mild, moderate and severe and numbers of subjects in each class and through improvements in the existing signal processing and classification techniques.

Acknowledgment
All the authors would like to acknowledge the neurologists from the Penang General Hospital for their support and assistance in providing the PD patients and healthy subjects throughout the research.