Word Embeddings Evaluation on Indonesian Translation of AI-Quran and Hadiths

Word vectors are an important part of machine learning. Word vectors are a numerical representation of text data. One of the methods that can be used to convert text into numerics is word embeddings. The word embeddings algorithm that researchers often use is Continuous Bag of Word, Skip-Gram, and FastText. This paper will discuss the transformation of textual data from Islamic knowledge domain documents into numerical forms using these three algorithms, then evaluate the word vector results using intrinsic and extrinsic evaluation techniques. We conduct intrinsic evaluations by determining the words to be evaluated, then checking for the existence of synonyms, antonyms, related words, and derived words from the nearest set of words based on vector values. We also tried to use vector words to solve word analogy problems. The best word vector in extrinsic evaluation is the result of the CBOW algorithm which is integrated with Binary Relevance and Multilayer Perceptron, with an accuracy value of 77.56% and a hamming loss value of 8.14%.


Introduction
Word embeddings are the vectors that represent words as a point in a multidimensional semantic space [1]. Some words that have semantic relations (synonym, antonym, word similarity, word relatedness, etc.) will have close vectors. Word embeddings will avoid the sparse matrix form in word representation, as happens when TF-IDF is used to represent a word [2]. That is, word embeddings will produce dense vectors with the predefined size of dimensions. The dense vector is also more effective while used as input features in the machine learning algorithm and prevent overfitting conditions. Dense vectors are also better in capturing synonyms than sparse vectors [1].
Based on searches, there are currently no pre-trained word embeddings for Islamic domain-based documents. In addition, there is also no research on evaluating word embeddings from documents based on the Islamic domain. Evaluation on word embeddings is quite important. Because word embeddings can affect the performance of machine learning algorithms. We hope that this word IOP Publishing doi: 10.1088/1757-899X/1077/1/012025 2 embeddings evaluation research can be used as a reference in choosing the right word embeddings algorithm for Islamic domain text.
In this paper, we extract word embeddings from Indonesian text (i.e. Al-Quran translation, Al-Quran interpretation, and hadith translation) using the Continuous Bag of Word (CBOW), Skip-Gram, and FastText algorithms. Word embeddings will be evaluated using intrinsic and extrinsic evaluation methods. Intrinsic evaluation is the word embeddings quality measures based on experiments in which word embeddings are compared to human judgments on word relations (synonym, antonym, etc.). Extrinsic evaluation is an evaluation technique that measures word embeddings quality based on the performance of the machine learning algorithm or other NLP tasks with word embeddings as the feature vectors. The main objective of this paper is to find the best word embeddings model from textual data of Islamic domain documents in the Indonesian language.
The main contributions of this paper are:  developing word embedding Indonesian Quran and Hadiths Translation,  provide a thorough evaluation on the developed word embeddings.

Literature Review
Nooralahzadeh, et. al. [15] evaluated domain-specific word embeddings from oil and gas textual data. About 8 million texts were trained on CBOW and Skip-Gram algorithm with a hyperparameter tuning experiment. They conducted two evaluations approach, which are intrinsic and extrinsic evaluations. On intrinsic evaluation, they randomly chose 100 unique words on the oil and gas domain then retrieved the 10-most-similar words for each unique word provided in the word embeddings. They evaluated word embeddings based on synonym, antonymy, and alternative form from the 10-mostsimilar words to evaluate the tuning experiment. After getting the best parameter, they analyzed the error of the 10-most-similar words using several categories: spelling variant, alternative form, references-synonym, human-judge synonym, antonym, hypernym, hyponym, co-hyponym, holonym, meronym, related, unrelated/unknown. The frequent errors occur in related categories, hyponym, and co-hyponym. On extrinsic evaluation, word embeddings were used as the features of the multi-label classification task. They ran basic CNN by Kim (2014) and several modified CNN models for the experiment. The best result obtained was from CNN models with two embedding layers (randomly embedding vector and domain-specific vector) integrated with the retrofitting method and Out of Vocabulary (OOV) handling. Similar to Nooralahzadeh, et. al. work's, Sarma, et. al. [14] trained "Substances User Disorders" (SUDs) dataset with a modified word embeddings algorithm for resulting high-quality word embeddings vector. They combined large scale corpora embeddings and domain-specific embeddings using linear Canonical Correlation Analysis (CCA) or a nonlinear kernel CCA (KCCA). They found that the CCA/KCCA with combined word embeddings improves substantially over the generic embeddings. They evaluated combined word embeddings in extrinsic evaluation tasks on binary sentiment classification with the Logistic Regressor method. These architectures tested on four public datasets: Yelp, Amazon, IMDB, and A-Chess.
Another work by Roy, et. al. on [16] evaluated their proposed method on two cybersecurity text corpora: a malware description corpus and a "Common Vulnerability and Exposure (CVE)" corpus. The authors developed a novel Annotation and Word Embedding (AWE) algorithm. They stated that there was a diversification of the types of domain knowledge. To overcome this, the authors built the protection of text annotations in the form of predicate structural arguments. The basis of AWE is the Word2Vec system with input to the AWE algorithm is the text and annotations that will produce output in the form of vector representations of words and annotations.

Methodology
We define the steps of this research in this section. Then we describe the technical implementation steps of the algorithms CBOW, Skip-gram, and FastText. We use the Word2Vec and FastText libraries in the Python programming language.

Methodology
The research methodology of this study can be seen in Fig. 1. We began by collecting text data from the Indonesian translation of Al-Quran, the Indonesian interpretation (tafseer) of the Al-Quran, and the Indonesian translation of the hadith in the book of Sahih Al-Bukhary. After collecting text data, we performed pre-processing the text, namely: case folding, sentence parsing, removing numbers and punctuations, repeated words checking (example: "buku-buku" (books)), and words tokenisation. From sentence parsing part, we obtained 157,870 sentences with 45,327 unique words. In Bahasa Indonesia, plural words are formed by repeating nouns and adding dashes between the two words [17], for example, the word "buku-buku" which means "many books". Therefore, it is necessary to check for repeated words so that the meaning of the words does not change and they are not split into two tokens.
Specifically, for the hadith text, we applied the Named Entity Recognition (NER) process to recognize the names of narrators. We use rule-based NER based on word writing patterns. Words that indicate the name of the narrator are written in square brackets ([...]). This process is intended so that the series of narrators names are not separated when doing the tokenization process. Example of original text and NER result on Table 1. Telah menceritakan kepada kami abdullah_bin_yusuf berkata, telah mengabarkan kepada kami malik dari hisyam_bin_urwah dari bapaknya dari aisyah ……………."

Implementation
There are two global architectures in Word2Vec [18,19], those are CBOW and Skip-Gram. We implemented both of them and FastText algorithm [20], then choose the best one.
Mikolov, et. al. [19] create Word2Vec to reduce computational complexity on the Neural Network Language Model (NNLM), developed by Bengio, et. al (2013). On NNLM, There is a probability distribution calculation on the hidden layer for all words in the vocabulary (V as the vocabulary set and |V| as the number of words in the dictionary) and produces an output with a V dimension length.  [19]. Mikolov tries two architectures to build word vectors, that are Continous Bag-of-Word (CBOW) and Continous Skip-Gram. The difference between these two architectures is seen in the purpose of architecture. The CBOW predicts the target word (wi) based on the context (surrounding word from wi), and the Skipgram aims to predict the surrounding words given the word wi.
Joulin, et. al. built FastText architecture for extracting word embeddings from texts [5]. FastText is similar to the CBOW model, where the middle word is replaced by a label, and they use a bag of ngrams as features to get some partial information from the local word.
First, we build the architecture of word embeddings, with default values for hyper-parameters, i.e. dim = 100, win = 5, min_count = 5, and neg = 5, while dim is the dimensionality of vector or vector size, win is the size of windows for grabbing context words, min_count is the minimum frequency of IOP Publishing doi:10.1088/1757-899X/1077/1/012025 4 the word, and neg is negative sampling size. Gensim library [21] was used for implementing CBOW, Skip-Gram (SG), and FastText architectures. Based on Table 2, it can be seen that the CBOW algorithm has the fastest training time than the other algorithms. In this evaluation, we used an evaluation model defined by Bakarov [25] in which he divided the word embeddings evaluations into two major parts, namely extrinsic and intrinsic evaluation. Extrinsic evaluation methods are based on the ability to use word embeddings as the feature vectors of supervised machine learning algorithms or as used in one of NLP tasks. The performance of the supervised model as a measure of the quality of word embeddings.

Intrinsic Evaluation
The intrinsic evaluation method will evaluate word embeddings based on human judgment (expert judgment). In this evaluation, we used human evaluators who would evaluate interrelated words produced by the word embedding architecture. On [15], three kinds of word relation for evaluating word embedding were used, i.e. synonym, antonym, and alternative form. We replaced alternative forms to related and derivative forms. Furthermore, we used Indonesian Thesaurus by Kateglo.com for synonym (ex. adil  jujur), antonym (lemah  kuat), a related word (nabi  rasul), and derivative form (iman  keimanan).
For each word in the glossary (supposed word g), we extract 10 related words (words wn) and group them into one of these following categories: synonym, antonym, related words, or derived words. If there is one category that matches, between g and wn then this pair is given the conclusion  Table 3 shows the example of 10 related words from "zhalim" (harsh) and the category of each related word.
We use Equation (1) to calculate true presentation for each word embeddings algorithm.
Where Syn i , Ant i , Rel i ,Der i ∈ {0,1}, and N is the number of rows. Table 4 describes the results of the evaluation of each algorithm. The algorithm with the best results is Skip-Gram and FastText that managed to get 52.53% of the most related words that can be categorized.

Word Analogy
The second most popular intrinsic evaluation is the word analogy method (some researchers also calls it linguistic regularities, analogical reasoning, or word semantic coherence) [25]. The idea of word analogy is solving mathematical operation with the operands are word vectors. For example, given a pair of words a and a*, and the third word is b. Then the relation based on the analogy between a* with a can be used to predict b* from b which has the same relationship a* with a. For example, car:wheel :: bird:b*, meaning that the wheel is "a part of" the car, so b* is "part of" the bird (wings, beaks, claws, etc.) [26].
To get the prediction of the word b* it is necessary to know the vector value of b*. Mathematically, to get the vector b*, it can be done with a simple vector operation as described in Equation (2). Furthermore, based on vector b*, we can search for the word with the closest vector to vector b* as shown in Equation (2) [27]. w b* ≈ w a* -w a +w b (2) In the study of English linguistics, it is easier to find a public dataset as the ground truth of analogy issues. Gao, et. al. [28] proposed WordRep dataset which is divided into 26 semantic classes, WordRep was developed from the "Google Analogy" data by Mikolov,et. al. [19] which combines morphological and semantic elements in pairing words. However, it is different for Bahasa Indonesia. The resources supporting Indonesian language studies are still very rare. Likewise, for the analogy dataset, based on our research no one has yet made an analogy dataset for Indonesian. Therefore, in testing with this analogy approach, we created our own dataset. The dataset we made are consisted of 38 pairs which contained 6 categories, namely antonyms, synonyms, hyponyms, hypernym, derivative words, and related words.
Based on the words analogy example, given a pair of words a, a*, and b we determine the word b* by applying Equation (2) and looking for words in vocabulary that have a similar vector to vector b*. The determination of similar words uses the method "3CosMul" [29] from Gensim library as described in Equation (3).
with V is vocabularies and ε = 0.001 use for preventing division by zero.
The results of the analogy test are shown in Table 5. Overall, the three embedding algorithms can solve analogy problems, especially in the class of synonyms, antonyms, derivative forms, and related words. Based on Table 5, it can be seen that the best algorithm that is able to solve word analogy problems is Skip-Gram. However, it failed when completing the hyponym and hypernym classes. Because hypernym and hyponym or vice versa are rarely appeared together as contexts and target words. In the derivative class, the FastText algorithm has 100% accuracy, because when FastText conducts training on words, this algorithm takes subwords based on n-grams word. This makes FastText succeed in retrieving similar words from morphologically similar words. For example on pair of words: budaya:kebudayaan :: bangsa:b* (cultural:culture :: nation:b*).

Extrinsic Evaluation
The next evaluation uses an extrinsic evaluation approach, we used word vectors from the CBOW algorithm, Skip-Gram, and FastText as the feature of multilabel text classification. For this experiment, we use the dataset from [30]. The contents of this dataset are the Indonesian translation text of Hadiths Shahih Bukhary. This dataset consists of 1064 rows where each row has 3 labels: [suggestion, prohibition, information] and each label represented in binary value {0, 1}. We split dataset into three parts: 45% for training, 22% for validation, and 33% for testing. We implemented pre-processing steps to the text including case folding, word normalization, remove punctuation, and tokenizing. In this evaluation, we solve the Multilabel Classification (MLC) task using the Binary Relevance (BR) problem transformation approach. In BR, we combined two different learning algorithms, Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). BR will treat each label as a separate single-label classification problem. We implemented the classification algorithms using Scikit-Multilearn library [31].  7 We performed hyperparameter tuning for MLP with this following parameter values: alpha = {0.00001, 0.000001} and hidden_layer_sizes = {(100,), (100.10,), (100,10,3,)}. While for the SVM, we tried three different kernels which are Radial Base Function (RBF), sigmoid, and linear. The learning algorithm input is an average of word vectors for each document. Table 6 shows the optimal parameter for each scenario. After getting the optimal parameters, then we re-train the learning algorithm using the optimal parameters. We used accuracy score, hamming loss, f1-micro average score, and f1-macro average score as the evaluation metrics.
Hamming loss calculates the symmetric difference between the predicted label and the ground truth then calculate the number of mismatches label set. The F1-macro-averaging evaluates each label by summing the number of true positives, false positives, true negatives and false negatives, and independently computes the F1-measure for each label. While the F1-micro-averaging counts once after the F1-measures for all labels have been collected [32].
Based on Table 7, the average testing accuracy of Multilabel Classification (MLC) is 76% and the average F1-macro average can reach 89%. This result proves that word vectors produced in this study are promising. Based on testing accuracy and hamming loss values, the best word vectors on extrinsic evaluation is the result of the CBOW algorithm integrated with Binary Relevance and Multilayer Perceptron.

Conclusion
This paper has discussed intrinsic and extrinsic evaluations of the corpus with the specific domain of religious texts. Based on our research results, CBOW is the fastest algorithm when conducting training on the data we use. Meanwhile, based on the results of the intrinsic test, the best algorithms in categorizing related words are Skip-Gram and CBOW with an accuracy rate of 52.53%. We also tested word vectors extracted from the text to solve word analogy problems, where the best performance was achieved by the Skip-Gram algorithm with an average accuracy rate of 70.25%. Apart from intrinsic testing, we also perform extrinsic testing. In this type of test, word vectors become feature learning on the Multilabel Classification (MLC) task for hadith text. The best algorithm for this MLC task is CBOW which is combined with Binary Relevance and Multilayer Perceptron (MLP). The level of accuracy obtained by CBOW+BR+MLP is 77.56% with a hamming loss value of 8.14%.
From this study, we conclude that special pre-processing (such as NER) are needed for certain data. For example, in the hadith text, it is necessary to separate the narrators and contents of hadith (matn) parts. For further research, we plan to perform fine-tuning in the word embeddings architecture so that the expected results of the word vector become more optimal.