A comparison of Neural Network and SVM on the multi-label classification of Quran verses topic in English translation

Indeed the Quran is the main guideline for Muslims. The enticing thing in the Quran is one verse of the Quran can be classified into more than one topic of discussion in the Quran, for example one verse discusses about prayer and faith or arkanul Islam and religions, thus this called multilabel case. In this study, we compared the multi-label Quran verse topic classification techniques using Na¨ıve bayes, ANN and SVM. Based on the experiment conducted, combination Na¨ıve Bayes algorithm with Mutual Information or Information Gain gives the best performance when it applied to construct the model with Hamming Loss of 0.0938.


Introduction
In Islam, the source of Islamic teachings and law will always be oriented on two sources, namely the Quran and Sunnah (Hadith). The Quran is the word of God and the last revelation made as a guide for mankind. The enticing thing in the Quran is one verse of the Quran can be classified into more than one topic of discussion in the Quran, for example Surah Al Bayyinah verse 98 which can be classified into verse that discusses about arkanul Islam and religions. Since one verse in the Quran can have more than one topic of discussion, it is called a multilabel case [1,2]. Contrast to a singlelabel case, which one verse can be mapped to one topic of discussion or class only.
To find out the topic of discussion of each verse in the Quran, we can ask it to someone who centainly understand the division of topics of disussion. Yet availability of either place, time, and resources are limited. Therefore digitalization is required for multilabel classification Quran verses based on the topic of discussion; with machine learning approach, where is the machine or computer used to categorize the verses of Quran. On the other hand, Quran structure is different from any text representation [3,4].
Generally in text classification, features are terms or words contained in the text. Usually a document or textual data contains considerable amount of words that can cause high computational complexity and decrease accuracy because some attributes may be irrelevant [5,6].
To overcome these problem, it takes a feature reduction. One way to perform the feature reduction is to use the feature selection [7]. Feature selection is a process to

Multilabel Classification
Text classification is part of supervised learning that is used to group data into a particular category. There are two types of classification, such as single label classification and multilabel classification. In single label classification, each data can only be grouped into one category, whereas in a multilabel classification each data can be grouped into more than one category.
There are two approaches that can be used in multilabel classification, which are problem transformation and algorithm adaption. Problem transformation approach will turn multilabel problems into a single label classification problem. While the algorithm adaption approach will classify multilabel data by using algorithms that have been designed to handle multilabel classification problems.
The method that will be used in this paper is problem transformation method, based on the result that is obtained from previous experiment [5,8,9]. The algorithm to be used in this research is Naïve Bayes (NB), Artificial Neural Network (ANN), and Support Vector Machine (SVM).

Selection and Feature Selection
Feature Selection in text classification is used to improve the effectiveness of the model, computation and classification efficiency [8]. Some feature selection include Mutual Information, Information Gain and Document Frequency, where the methods can produce the best results in previous experiments [7,10].
Mutual Information (MI) select a subset if relevant attributes (words). The selected attributes are considered to be more informative and effective for the classification model. The output of MI is a matrix of size n x n, where n is the number of variables. Each value in the matrix represents a corresponding value between two variables [11,12]. In this research, the authors used only MI values between word and class variable. Calculation of MI is shown in (1).
Information Gain (IG) is used to measure the relevance level between attribute A and class C. The higher the value between attribute A and class C, the higher the relevance between attribute A and class C [13]. Calculation of IG is shown in (2).
Document Frequency (DF) thresholding is the simplest technique for dimension reduction. DF easy scale for very large corporations with computational complexity is

System Design
The system built in this research is a system capable of classifying multilabel topics verses of the Quran. Figure 1 visualizes the design of system.  Figure 1, the flow of system in this study can be described as follows.
The topic of classification dataset used is the topic of the classification of the verses of Al Quran Cordova in the English translation. The number of Quran verses used in this study is 6236 verses which are divided into 15 classes. Table 1 presents lists of topics in the classification of the verses of Al Quran Cordova.
(ii) Preprocessing, we prepared the dataset to be used in the classification process, removal of attributes that are not needed, thus the data becomes more structured. Figure 2 visualizes the design of system.   Tokenization turn the sentence in document into tokens. In this study, we cut sentence into pieces of word.

• Stopword Removal
It removes common words considered as less rele-vant word in order to reduce the number of words to be processed. In this study, the authors used a stopwords list from Wordnet.

• Lemmatization
Lemmatization is the process of returning word in the token list to a common base form of the word. Lemmatization grouping together the inflected forms of a word so they can be analysed as a single item.
(iii) Feature extraction, we used the representation of bag of words model. It count each word in vocabulary list obtained from training set for each document.
(iv) Feature selection, we used MI, IG and DF to select features contained in data. The result of feature selection will be used to create a classification model.
(v) Training Classifier, we create a classification model using NB, SVM, and ANN by using words that pass the feature selection process.
(vi) Classification, we classify the data using NB, SVM, and ANN with problem (vii) In final stages, the performance of system is done by using Hamming Loss method.
The scenarios that were used in this experiment is to build the model by combining the feature selection method and the classification algorithm. The first scenario is to build a classification model using all attributes (without feature selection). The second scenario is to apply of feature selection method before construct of the classification model.

Results and Discussion
The first scenario was conducted to determine which classification algorithm that can produce classification models that have the best results. The algorithm used is Multinomial Naive Bayes, SVM with RBF kernels and C = 1.0, and ANN using stochastic gradient descent or LBFGS optimizer. Table 2 shows experiment results of the model built without implementing feature selection method.  Table 2 it can be seen that SVM has the smallest Hamming Loss value. This is because, SVM can indeed suffer in high dimensional spaces where many features are irrelevant [14].
The second scenario was conducted to determine which feature selection algorithm that can produce classification models that have the best results. In this experiment, several number were applied as threshold for the feature selection, which were 0.1 to 0.9. Figure 3, 4, and 5 show the performance of Naïve Bayes, SVM and ANN after feature selection using DF, IG, and MI. It can be seen in Figure 3, and 5 that MI and IG gives the best result on average, while in Figure 4 DF gives the best result on average. Figure 3, resulted comparison between MI, IG and DF using Naïve Bayes algorithm, shows influence of number of input variables to classification performance. it can be seen that the classification model that using IG and MI, the more number of input variables tends to decrease classification rates, but the number of input variables does not apply to DF. It can be seen from Figure 3 that when the number of input variables is 20% it will produce the smallest hamming loss or the best result.   Figure 4, it can be seen that the difference in the number of inputs variables has no significant effect, this is caused by SVM can overcome high dimensional problems, so svm can work well even if it doesn't use feature selection.  Figure 5, the use of feature selection is quite influential on classification rate. Based on the Figure 5 that the best results are obtained when using 70% number of input variables. In the Figure 5, it can be seen that the performance changes when using 70% number of input variables, this is due to the presence of attributes that are not very influential in training data but have an effect on testing data. Table 3 shows the best experiment results of each model. While the experiment using DF showing results that are not too significant compared without using DF, because DF only selects features based on counts the most common words and does not pay attention to the relevance of the class. So that a feature can often to appear but has no relevance to the output class.
On the other hand, Naïve Bayes model would consider each words probability independent of any other word. This model obviously makes several rough, informationdiscarding assumption, but its work very well for text classification.

Conclusion
Multilabel classification that is conducted in this experiment gave the best performance with Hamming Loss of 0.0938. Those best result was achieved by implementing a model using Naïve Bayes algorithm, with the implementation of Mutual Information or Information Gain, where Mutual Information and Information Gain produce similiar performance of the classifier. This is because both Mutual Information and Information Gain can choose the most relevant attributes to be included in the classification model.