Semantic text relatedness on Al-Qur’an translation using modified path based method

Abdul Baquee Muhammad [1] have built Corpus that contained AlQur’an domain, WordNet and dictionary. He has did initialisation in the development of knowledges about AlQur’an and the knowledges about relatedness between texts in AlQur’an. The Path based measurement method that proposed by Liu, Zhou and Zheng [3] has never been used in the AlQur’an domain. By using AlQur’an translation dataset in this research, the path based measurement method proposed by Liu, Zhou and Zheng [3] will be used to test this method in AlQur’an domain to obtain similarity values and to measure its correlation value. In this study the degree value is proposed to be used in modifying the path based method that proposed in previous research. Degree Value is the number of links that owned by a lcs (lowest common subsumer) node on a taxonomy. The links owned by a node on the taxonomy represent the semantic relationship that a node has in the taxonomy. By using degree value to modify the path-based method that proposed in previous research is expected that the correlation value obtained will increase. After running some experiment by using proposed method, the correlation measurement value can obtain fairly good correlation ties with 200 Word Pairs derive from Noun POS SimLex-999. The correlation value that be obtained is 93.3% which means their bonds are strong and they have very strong correlation. Whereas for the POS other than Noun POS vocabulary that owned by WordNet is incomplete therefore many pairs of words that the value of its similarity is zero so the correlation value is low.


Introduction
Semantic is the knowledge which learning about meaning of the word in the language. The semantic related with meaning of the word relation, it's like in synonim, antonim and hyponim. Morrist and Hirst 2004 [2] stated that semantic relatedness explains the power of semantic relationships between two words or concepts are measured. It includes various relationships among concepts, belonging classical relationships as hypernymy, hyponymy, meronymy, antonymy, synonymy, and any other 'non classical relations'.
Weeds [2] defined that semantic similarity is a specific case of relatedness, where the power of relatedness depends on the 'degree of synonyms', that is normally explained by classical relations. In the taxonomy or lexical hierarchy generaly are shaped from hypernym and hyponym words. Hypernym is the words that have general meaning from the other words. Hyponyms are a group of words that are of special significance meaning rather than measured words. Hyponym is a member of group of a hypernym meaning groups, eg hypernym = sport, hyponym = football, basketball, tennis, run.
Semantic similarity between word is often represented by similarity between concept that associated with the word. Concepts can have different meanings but remain semantically interconnected like sports and balls. Other instance that antonyms are respected has a semantic relationship such as smart and stupid, however they are dissimilar. A way that can be used to find out the value of similarity is to calculate the value of shortest path length between two words. The similarity value is also influenced by the depth of subsumer value of the lowest common subsumer (lcs) nodes from two concept. The similarity value measured between two concepts will increase if the depth of subsumer value of the two concepts also grows higher in the taxonomy.

Related Work
In the study that conducted by Liu, Zhou and Zheng [3] on a semantic similarity using path-based method managed to get a correlation value of 92.6%. The correlation value still has an opportunity to be improved because of the research by Liu, Zhou and Zheng [3] only optimizing the semantic relationship of hypernym and hyponym. In the Semantic relatedness there are many semantic relationships besides hypernym and hyponym such as synonym, antonym, meronym and holonym. Taking advantage of all semantic relationships is expected to increase the correlations value that have been achieved previously. Abdul Baquee Muhammad [1] have built Corpus that contained AlQur'an domain, WordNet and dictionary. He has did initialisation in the development of knowledges about AlQur'an and the knowledges about the relatedness between texts in AlQur'an. The Path based measurement method that proposed by Liu, Zhou and Zheng [3] has never been used in the AlQur'an domain. In this research the path based measurement method proposed by Liu, Zhou and Zheng [3] will be used to test this method in AlQur'an domain to obtain similarity values and to measure its correlation value. To obtain a better correlation value the degree value is proposed to be used in modifying the path based method that proposed by Liu, Zhou and Zheng [3]. Degree Value is the number of links that owned by a lcs (lowest common subsumer) node on a taxonomy. The links owned by a node on the taxonomy represent the semantic relationship that a node has in the taxonomy. By using degree value to modify the path-based method that proposed by Liu, Zhou and Zheng [3] is expected that the correlation value obtained will increase.

Semantic Similarity
Weeds [2] defined that semantic similarity is a specific case of relatedness where the power of relatedness depends on the 'degree of synonyms', that is normally explained by classical relations. Concepts that are semantically related are not necessarily similiar such as car and fuel. Another example is that antonyms are considered to be semantically related such as beautiful and ugly, however they are dissimilar.
The semantic similarity between words has become a widespread topic in natural language processing and information retrieval for many years. It is widely used in a variety of applications such as sense disambiguation, detection and correction of malapropisms, multimodal documents retrieval, and automatic hypertext linking [3].
Similarity between words is often represented by similarity between concepts associated with the words. Many methods are developed to compute word similarity nowadays. Generally, these methods can be classified into two categories, one is edge counting based method that using only semantic link and second is corpus based methods that combining corpus statistics with taxonomic distance. The information sources that commonly be used in similarity measures are the shortest path length between words and the depth of subsumer in the taxonomy hierarchy [3].
The key of calculating semantic similarity lies in simulating human thinking behavior. Semantic similarity between words is decided by processing the first hand information sources in the human brain. Therefore the semantic similarity measurement can simulate human judgement by using multiple information sources [3]. Human tends to decide similarity between concepts through comparing their common features and different features. Therefore it be considered that the common and different features of two words to simulate the process of human judgement. It is assumed that human judgement for semantic similarity is a process of comparing equal attribute with aggregate attrbute that is the sum of equal and different attribute [3] . The model of semantic similarity is shown in formula [1].
Where l denotes the shortest path length between w1 and w2, d denotes the depth of subsumer in the hierarchy semantic net, f is the transfer function for d and l [3].  [3] methods achieves the best corelation 0.926 with the MC's average human value of 28 pairs of words compared with the whole method. This mean that Liu, Zhou and Zheng [3] method's is more accurate than most individual subjects. Therefore human judgement for semantic similarity can be simulated by the ratio of common features to the sum of common and different features between words. Figure 1 is a general stage of the system developed in this research. The system built in this research can perform the relatedness measurement and correlation measurement. The query is used as WordNet input to find their Wordset. In data preprocesing stage, the Wordset and 60 verses of AlQur'an are processed to make it being standart form. With Taxonomy and proposed method, the Wordset and 60 verses of AlQur'an are measured to find their relatedness score. With correlation method, the relatedness score and SimLex score are compared and measured to find their correlation score. The general process of the system as follows

Presentation of Wordset Dataset
Wordset is output from a query in the semantic sense of the word search process from the WordNet input, where the output shape set of semantic words and collected in one set of words. One word of WordNet input can produce multiple outputs, so the WordNet outputs are said can be collected in a set of words or Wordset. The query words that be used in this experiment are taken from the Pillar of Faith in Islam are faith, Allah, god, angel, holy, messenger, prophet, death, disaster, destiny. Query

Presentation of 60 verses of AlQuran Dataset
The second dataset is a dataset that containing 60 verses translations of AlQur'an. The second dataset must be processed through preprocessing stage so the output can be more accurately. There are three steps in the preprocessing process, those process namely TSS (tokenization, stopword and stemming). 60 verses of AlQur'an translation comes from surah Al-'Asr verse 1 to the surah of An-Naas verse 6.

Presentation of SimLex-999 Dataset
The correlation value calculation need dataset that be used to compare data from similarity measurement. The dataset can be extracted from SimLex-999 [13], while the SimLex-999 has 999 words pair which completed with their similarity score. The similaity measurement needs 200 words pair that be used in the computation. 200 words pair that taken from SimLex-999 are completed with their similarity score, then 200 words pair are used in similarity computation through proposed method.

The System Improvement
Keijo Ruohonen [16] defined that degree of the vertex is the number of edges with vertex as and end vertex. Number of edges that exist at a vertex are the edge that derived from that vertex or towards to the vertex. The degree value can be used to compute the number of the edge that established at the vertex. Each vertex in the vertex network will have a degree because each vertex will have a relationship with another vertex.
WordNet has a word hierarchy or commonly called taxonomy. In the taxonomy there are relationship between synsets, the relatinships are synonym, antonym, hypernym, hyponym, meronym and holonyms. That relationship between synsets are connected by several links to form a word or synsets network. Each synset in taxonomy can have multiple links depending on the relationships that are formed and owned with other synsets.
Therefore the degree value can be obtained from the number of links formed from a node connected to another node through the hypernym, hyponym, meronym and holonym relationships. And degree value which is proposed to modify the formula is the number of edge which exists at the lowest common subsumer (lcs) vertex. Therefore the degree value can be proposed to modify the formula derive from Liu, Zhou and Zheng [3], and the formula changed like shown below.
where b is the degree value from the lowest common subsumer (lcs) node. In the path based method that proposed by Liu, Zhou, and Zheng [3] the semantic relationship that be used is the relationship of hypernym and hyponym only. While in semantic relatedness there are more existing relationships, not only hypernyms and hyponyms relationships but also synonyms, antonyms, meronyms and holonyms relationships. These relationships have not been optimized in the path based method that proposed by Liu, Zhou, and Zheng [3].
In the proposed method it uses the variable degree value, where the variable degree value is the number of links that owned by a node. Those links are representations of semantic relationships that owned by a node in WordNet taxonomy. So by using the variable degree value all existing semantical relationships can be accommodated well in the proposed method.

Result
From the experiments that have been done, in accordance with the hypothesis, with the addition of degree value in semantic relatedness method has proven can improve the performance of relatedness value measurement. For analyzis the effectivenes of the proposed method, each POS in WordNet are measured their correlation value. The following are the result of relatedness value measurement and correlation value measurement.

Relatedness Value Measurement
In the tables displayed in this research, Path denotes method used by Liu, Zhou and Zheng [3] in doing research that is path based method, Proposed denotes the method that be proposed in this research.

Figure 3. Relatedness Value in General Domain
This experiment is run to find out the highest relatedness value that can be obtained by each method on AlQur'an domain or general domain. At Figure 2 shows the result from the experimental measurements in AlQur'an domain indicate that the value of relatedness measurements by using the proposed method get better result than the Path based method proposed by Liu, Zhou and Zheng [3]. And at Figure 3 shows the result from the experimental measurements in General domain indicate that the value of relatedness measurements by using the proposed method get better result than the Path based method proposed by Liu, Zhou and Zheng [3]. From the experiment of relatedness measurements value in the both domains obtained results that the value of relatedness measurements by using the proposed method get highest result better than the Path based method proposed by Liu, Zhou and Zheng [3].

Analyzis in Part of Speech of Wordnet
This experiment is running to measure the effectiveness and reliability of the method. The analyzis process that will be performed is comparing the results output from relatedness calculation process using the proposed method and with a list of similarity score of words pairs that come from SimLex-999. Both values are compared to check their correlation whether the change in value on the results of The details about strength of correlation has explained by Jonathan Sarwono [15]. The strength of correlation according to Jonathan Sarwono [15] are shown at Table 5. After through some correlation value computation step which involve some POS dataset, the correlation value from three POS dataset will be displayed and be compared each other. The comparation is performed to find where the POS parameter that can make correlation value obtain maximal result or to find where the POS parameter that cannot make correlation value obtain maximal result. The correlation value which computed with three POS parameter dataset are shown at Table 6 and Figure 4.

Analyzis in Variable of Proposed Method
Proposed methods consist of three variables, those variables are variable d denotes depth of subsumer, variable l denotes shortest path length between two concept and variable b denotes degree of lcs node or number of links in the lcs node. This experiment is running to analysis the variables to find the variable that most influence in the changing of similarity value and correlation value and the variables that are less influential on changes in similarity value and correlation value. After through some similarity value and correlation value computation with involving the changing of varibale d, l and b which each variable has value 1, the similarity value and correlation value from three variables will be displayed and be compared each other. The comparation is performed to find where the variable that more influences in similarity value and correlation value measurement and to find where the variable that less influences in similarity value and correlation value measurement. The comparation of similarity value and correlation value are shown at Table 7 and Figure 5.

Discussion
From the result of the relatedness computation experiment revealed that proposed method obtain better relatedness value if be compared with Path based method that used in Liu, Zhou and Zheng [3] research. Proposed Method obtains better result because its involves degree value in the path based formula so it can increases relatedness value and correlation value. The degree value is used in this proposed method, degree value has means the number of links that owned by a node. Those links are representations of semantic relationships that owned by a node in WordNet taxonomy. So by using the variable degree value all existing semantical relationships can be accommodated well in the proposed method.
Based on the correlation computation experiment indicated that parameter which influences in correlation measurement is Noun POS dataset. With Noun POS dataset, correlation value obtains 78% using path method and 93,3% using proposed method. The result has mean that relatedness computation result has correlation with Noun POS dataset from SimLex-999 and the strength of their correlation is very strong. The lowest correlation with relatedness computation result is adjective POS dataset from SimLex-999 because their correlation value just 10,1% using path method and 1,5% using proposed method.
For Noun POS obtains high correlation value because in WordNet for Noun POS has a more complete vocabulary so rarely there are pairs of words whose value of similarity is zero so that the correlation value becomes high. Whereas for the POS other than Noun POS vocabulary owned WordNet is incomplete so many pairs of words that the value of its similarity is zero so the correlation value is low.
From analysis of the proposed method and based on the experiment of the similarity value measurement, the most influential variable is the variable b which has deviation value 0.380 and the less influential is the variable d which has deviation value 0.083. In the measurement of correlation value, the most influential variable is the variable l which has deviation value 0.267 and the less influential is the variable d which has deviation value 0.027.
The variable b becomes the most influential variable in the relatedness value measurement because the variable b is the number of links owned by a node on taxonomy and a node in taxonomy WordNet has many semantic links so that when value b is made minimalis (1) it will be very influential to the relatedness value obtained. The variable l becomes the most influential variable on the correlation value measurement because the value of l is the distance or the number of links formed between two words (concept) measured on the taxonomy and the distance between the concept in taxonomy  WordNet far apart so that when the value l is made minimalis (1) will be very influences the correlation value obtained.

Conclusions
By involving the degree value in the path based formula, proposed method will get better relatedness value while in the example query 'messenger' and the verse is 114:2, it can reaches 98,7% compared by using the path based method that proposed by Liu, Zhou and Zheng [3] its reaches 56%.
With the proposed method, correlation value measurement can obtain fairly good correlation ties with 200 Word Pairs derive from Noun POS SimLex-999. 93.3% is the maximum correlation score that can be reached in this research, that value has means their bonds are strong and they have very strong correlation.
The correlation value measurement in Adjective POS and Verb POS do not obtain high result as Noun POS result because for the POS other than Noun POS vocabulary that owned by WordNet is incomplete, therefore many pairs of words that the value of its similarity is zero so it makes the correlation value is low.
Measurement of correlation value is influenced by relatedness value that obtained from calculation process and relatedness value that obtained from external source such as Simlex-999 which Simlex-999 is used as human judgment and comparator in calculation of correlation value.
Based on the proposed method analysis in the similarity value measurement, the most influential variable is degree value which has deviation value 0.380 and the less influential is depth of subsumer which has deviation value 0.083. In the measurement of correlation value, the most influential variable is shortest path length which has deviation value 0.267 and the less influential is depth of subsumer which has deviation value 0.027.
This study has several side that still can be developed to achieve better value for relatedness and to assist the development of research on text and AlQuran both by the Qur'anic and text researchers or the academic community from all coleges.

Future Works
The number of existing methods in the measurement of text relatedness such vector based method, information content based method, gloss based method can be tried in the experiment to open opportunities for possible improvement that can be found in the text relatedness research in the future work later.
By using other knowledge bases besides the Wordnet, there is a big opportunity to get an increasing of relatedness value, because there are some limitation in this experiment so it has not had enough time to test. This would open the opportunities for future work in the development of research on the semantic text relatedness in the AlQuran domain.
With so many verses in the AlQuran which reached 6622 verses which in this research has not been performed, additional number of verses or use other set of verses in the AlQuran can be used in the subsequent research to open opportunities the possibility of increasing relatedness value that will be get.
Due to several limitations in this study led to the possibilities that be mentioned above have not had enough time to do the research to find the likely possibilities that can be obtained from several variations of the research and experiments that could unlock the barriers that may exist and hinder the increasing relatedeness value.