Arabic voice system to help illiterate or blind for using computer

Speech recognition is one type of technology, which make a computer to recognize the voice of words that an individual speaks through a microphone and convert it into the written text. In this paper, the proposed system for helping illiterate and blind peoples to open applications with them voices. The proposed system includes two parts; the first part is the training part while the second part is used for the testing. The system contains seven-steps; the first step is the recording of voices and the second step is voice pre-processing. The third step is the feature extraction using MFCC (Mel Frequency Cepstrum Coefficient) method that involves seven steps. The fourth step is for classification voices, there are 1400 voices samples used in training by using a Naïve Bayesian method as a classifier. The fifth step is the matching step using the Correlation Coefficient, there are 200 voices samples in testing. The sixth step was to convert voice into text and the seventh step for execution one of 20 commands. The accuracy of results from using the Naïve Bayesian algorithm in the training phase gives (100 %) while the accuracy of results in the testing phase using Correlation Coefficient gives (98%).


Introduction
Human beings interact with each other in many ways, including speaking words, hand gestures , facial expressions, etc. But speech is considered to be the most important means of human use, because it promotes contact and is the most commonly used by speakers [1] .
Speech is a useful term that has a specific meaning and is composed of several words which, in effect, include several letters accompanied by voices [2]. The researchers adapted this contributed to the advancement of artificial intelligence, which helps to create more versatile methods of managing the computer, which enables the user to communicate and share information without using existing input / output modules like the keyboard [1].
Most of the worldwide voice recognition systems are not available in the Arabic language due to its current challenges. The research effort on Arab Automatic Speech Recognition ( ASR) is sadly still insufficient, despite its significance. Arabic is one of the most widely spoken languages. About 420 million people speak Arabic worldwide making it is one of the six United Nations (UN) spoken language. One may describe three different forms of Arabic language: Classical Arabic, Modern Standard Arabic (MSA), and Colloquial Arabic [3] .

Speech Recognition
Speech recognition is the mechanism where the human speech is inserted into the device in analog form and the computer translates it into digital form to make it comprehensible.
It is the method of translating a speech signal through an algorithm implemented as a computer program to a series of words (i.e. spoken words to text). The job is to get a machine to understand the language spoken. Through "understand" we mean properly responding and translating the input speech into another form [4]. This is too known as "Automatic Speech Recognition" (ASR), "machine speech recognition", "or simply "speech to text" (STT) [1]. The following figure(1) display the block diagram of a speech recognition system [5].  [6] They are developing a system that allows the computer to use MFCC and VQ techniques to convert voice requests and dictations into text. Using MFCC and Vector Quantization method, feature extraction, and feature matching will be performed. The extracted feature is stored in a file (.mat). A measure of distortion that is based on minimizing the Euclidean distance will be used while matching the unknown speech with the speech signal database. Jaffar Alkhier 2017 [7] in the process of feature extraction, he proposed three systems for speech recognition, which vary from each other in the methods used. While the MFCC algorithm used by the first system, the second system used the LPCC algorithm, and the PLP algorithm was used by the third system. HMM has been used as a classifier by all these three systems.
Dr.Ch. Raja 2020 [8] he has using standard Mel-Frequency Cepstral Coefficients (MFCCs), and pitch-related features derived from a speech segment, automatic detection of emotions has been tested. Such acquired characteristics are then categorized using the Naïve Bayes classifier. The accuracy of recognition for these characteristics is taken into consideration as it mimics the perception of the human ear.

Proposed System
The proposed system includes two parts as shown in Figure (2), the first part is for training while the second part is used for the testing. The system contains seven-phase; the first phase is recording the voices, the second phase is voice preprocessing, the third phase is the feature extraction using MFCC method that involves seven steps, the fourth phase is the classification phase, using 1400 samples using a Naïve Bayesian method for training.
The fifth phase is the matching phase using the Correlation Coefficient, using 200 samples for testing. The sixth phase was converted Voice into text and the seventh phase has executed the command for one from 20 command.

Recording Commands Voice
We collected our database by recording voice samples from different persons (males and females). There are 1600 samples recorded for 20 commands used for training and testing.
All the voice's signals were obtain under different conditions, such as the length of record time and the level of sound amplitude. Each command was record in the format (.wav) of 16000Hz.

Speech Preprocessing
Preprocessing is do before feature extraction to remove the noise in the voice signal and to improve recognition accuracy. Preprocessing operations such as pre-emphasis, framing, windowing, and endpoint recognition must be conducted prior to processing and analyzing the voice signal. The extraction of speech features is affected by speech preprocessing.

Feature Extraction Using MFCC
At this phase, MFCC technology was used to help extract the characteristics of a voice signal that helps to distinguish the content of this one signal from others, which is one of the best methods and most widely used to extracted features for speech recognition. In the training and testing stage of the classification algorithm and matching, these characteristics were use. In the input signal, MFCC applies those steps shown in Figure (

Naïve Bayesian Classification
The Naïve Bayesian algorithm is a classical classification algorithm that in different applications has proven its simplicity and efficiency [15]. The Naïve Bayes classification algorithm is a simple and very efficient method of probabilistic classification and shows that good performance can be comparable to neural networks and decision tree learning in some areas. In the large data set, Naive Bayes showed acceptable speed and accuracy, but in the case of the small data set, the effect is noticeably poor. However, the classification performance of this system is still strong in the limited sample set [16].
Naive Bayesian classifiers presume that the effect on a given class of an attribute value is independent of the other attribute values. Class conditional independence is called this assumption. It is designed to simplify the computations involved and is called "naïve" in this context [15]. In classification, the Bayes methods are common because they are optimal.
It's important that prior probabilities and pattern distribution to class must be known in order to apply Bayes methods. The pattern is assigned to the upper class of posterior likelihood [17]. It is possible to represent the Bayesian theorem as: 'X' is a tuple from a database 'D' and the probability of a certain tuple 'X' falling under a particular class is H. "P(X|H)" and "P(H|X)" represent the posterior probabilities indicating the conditional evaluations of 'X' on 'H' and 'H' on 'X', respectively. "P(H)" and "P(X)" is prior probabilities and are unconditional, which means that they are not dependent on others [18].

Matching using Correlation Coefficient
The coefficient of correlation is a statistical calculation representing the degree of linear correlation between the two variables, with a value range of [-1,1]. The higher the absolute value, the better the correlation is. Otherwise, the weaker it is [19]. In statistical analysis, pattern recognition, and image processing, the approach of correlation is commonly used.
The Correlation principle is used to recognize the spoken word perfectly [20].
Cross-correlation is a measure of the similarity between two waveforms, in communication and signal processing as a function of a time-layer related to one of them.
these are typically used to search for a shorter, recognized feature for a long-signal. Which also has applications in pattern recognition [21]. This is described in the following equation Their xi is the intensity of the i pixel in voice 1, yi is the intensity of the i pixel in voice 2, xm is the mean intensity of voice 1, and ym is the mean intensity of voice 2.

Database Recording
The recorded voices contain 20 commands which recorded in the Arabic language. The

Results
The results will start from preprocessing into commands execution as explain in following steps:

Speech Pre-processing
The voices or speech firstly pre-processing step to remove noise and deal by filter voices also segmentation used to divide voice into frames. The results of preprocessing for command ( ‫ﺍﻓﺘ‬ ‫ﺍﻟﻤﺴﺘﻨﺪﺍﺕ‬ ‫ﺢ‬ ) as shown in Figure (

Features Extraction
At this phase, we extracted features of 20 commands for 1400 samples using the MFCC algorithm in the training stage and store the results in file with format (.mat) as shown in

Training by Naïve Bayesian Classification
In the training stage, Naïve Bayesian was use for training the classes of 20 commands and store the results in a database to be compared with new voices in the testing stage. The result of applying Naïve Bayesian algorithm using the measurement accuracy rate that compares the features between voices with a database of the number of 1400 as shown in   In this stage, the voice will be convert into text after the matching phase has been complete, and this text will be the commands that will be send to the Python program where the required command of the person to be execute as shown in Figure(

Execute Voice Commands
In this stage, the last result (text) from the system will appear, which will implement the voice commands required to open the (asked) applications in computer screen as shown in  The aim of this paper to build system may help the illiterate or blinds persons to use the computer using them voice to open the most important orders in the computer without using them hands. We find that the signal pre-processing stage is most important stage before feature extraction and pattern matching which is the main elements of a speech recognition system. The feature extraction stage was gave high accuracy when we use the MFCC algorithm where there are 13 important features extracted for each voice signal. In the training phase, we found that the Naïve Bayesian Classification has proven its simplicity and efficiency method of probabilistic classification and shows good performance. In the testing phase, we found that the Correlation Coefficient is a measure of the similarity between two waveforms, and the reason of using this method as used to search for a shorter, recognized feature for a long-signal.