Hierarchical Rhetorical Sentence Categorization for Scientific Papers

Important information in scientific papers can be composed of rhetorical sentences that is structured from certain categories. To get this information, text categorization should be conducted. Actually, some works in this task have been completed by employing word frequency, semantic similarity words, hierarchical classification, and the others. Therefore, this paper aims to present the rhetorical sentence categorization from scientific paper by employing TF-IDF and Word2Vec to capture word frequency and semantic similarity words and employing hierarchical classification. Every experiment is tested in two classifiers, namely Naïve Bayes and SVM Linear. This paper shows that hierarchical classifier is better than flat classifier employing either TF-IDF or Word2Vec, although it increases only almost 2% from 27.82% when using flat classifier until 29.61% when using hierarchical classifier. It shows also different learning model for child-category can be built by hierarchical classifier.


Introduction
Actually domain of scientific papers become one kind of document that is often needed by academics and researchers. Then they use it as reference or comparison for their research by collecting relevant information [2]. This information can be composed of rhetorical sentences to be structured from some categories as needed by readers [3]. Rhetorical sentence itself is each sentences in document which is assigned with label of defined rhetorical categories based on its text features and rhetorical structure is a connected structure of texts, in which every segment of a text (i.e. sentence) has a meaningful category in the body of each section in the document [1].

Figure 1.
Flat classification approach using a flat multi-class classification [10] The rest of this paper is organized as follows. The next section provides the related works on rhetorical sentence categorization from Teufel and implementation of Word2Vec semantic representation and hierarchical classification. Our method is explained in section 3. We also define various rhetorical categories and feature sets for classifying sentences. We describe our experiments in section 4. The results are analyzed in section 5. Finally, the conclusion and further work are described in section VI.

Related Work
Rhetorical sentence categorization is commonly used for scientific papers. Teufel and Moens [7] conducted this task by assigning one of seven rhetorical categories to every sentence. These seven are "aim", "textual", "own", "background", "contrast", "basis", and "other". They employed Naïve Bayes classifier and got the F-measure result from 26% to 86%. Then, its result has risen up until 92.93% after employing Maximum Entropy [6]. These seven categories next become 15 rhetorical categories [5], namely "aim", "nov_adv", "co_gro", "othr", "prev_own", "own_mthd", "own_fail", "own_res", "use", "own_conc", "codi", "gap_weak", "antisupp", "support", and "fut". The distribution detail of 15 categories from 7 rhetorical categories is shown in Table 1. Widyantoro, et al. [8] implemented these 15 categories added with "textual" by adapting Teufel's features and employing multi-heterogeneous classifier. Its average of F-measure result is about 25%. Actually, they used different corpus from Teufel's [4]. This condition could affect the final rhetorical categorization model, because Teufel employed meta-discourse feature, in which their existing grammars depend on word-patterns that always appear in sentences from corpus. For the second, Word2Vec is used to group words which have similar meaning into vector representation. This model is proposed by Mikolov [12] and released by Google in 2013. There are two architectures of Word2Vec, namely the continuous bag-of-words (CBOW) and skip-gram. The CBOW predicts the current word based on the context and the skip-gram predicts surrounding words given the current word [13]. Heffernan and Teufel [14] employed Word2Vec representation to identifying problem statement in scientific text. They used 18,753,472 sentences from a biomedical corpus based on all full-text Pubmed articles and then built model from the 200 semantically similar words to only "problem". Its result showed that Word2Vec model caused significant performance increase, because Word2Vec attributes have the greatest information gain compared the other features.
At last, flat classification is commonly conducted for text categorization. Actually, there is another technique named as hierarchical classification. The difference between them is flat classification only considers leaf nodes (child-category) without parent-category [9] [10], but hierarchical classification considers parent-category first then child-category from each parent. It assigns more suitable categories from the defined hierarchical category in document [15]. Tegegnie, et al [16] classified a heterogeneous collection of Amharic News Text by employing hierarchical structure. It shows that hierarchical classifier is better than flat classifier. The accuracy when using level three classifier with the top 15 features is 89.06%. It significantly improves 29.42% from the accuracy when using flat classifier.

Methodology
In this research, we use the data set of scientific paper from Khodra, et al. [3] added with 50 new scientific papers from ACL Anthology Reference Corpus (ACL-ARC). This data has 75 annotated scientific papers that have been split into 10,880 sentences. It is added with 50 new scientific papers, then our data set has 16,046 sentences. Training data is 100 and testing is 25 scientific papers. The number of training data set is 12,594 and testing data set is 3,452.
Every sentence in these papers has been annotated into one of 16 rhetorical categories from Teufel, et al. [5], namely "aim", "nov_adv", "co_gro", "othr", "prev_own", "own_mthd", "own_fail", "own_res", "own_conc", "codi", "gap_weak", "antisupp", "support", "use", "fut", and "textual". The description of these categories is explained in [8]. We adapt features from Teufel [4] and [3], namely content, absolute location, explicit structure, sentence length, syntax, citation, and formulaic & agentivity (meta-discourse). Its explanation can be seen in Table 2. Our method is generally divided into four main processes. The first is pre-processing, and then constructing the vector representations of the vocabularies of data set by using TF-IDF and Word2Vec. The second is producing all the features. The third is building classification model. This will employ technique of hierarchical classification. The last is we conduct testing to get the Fmeasure by using some classifiers, namely Naïve Bayes [17] and SVM Linear utilizing LibSVM [18] library. The general process is shown in Figure 2.
Technique of hierarchical classification is conducted to the child category after doing the first classification in parent-category. So, there are two step of flat classification for parent-category and child-category. In this research, we use this technique for category "contrast", "own", "basis", and "other" because only these categories are divided into some child-categories [4] [5]. This hierarchical rhetorical category schema that will be done is shown in Figure 3. As shown in Table 1, "nov_adv" and "fut" are not derivative from any category, but in this paper we assume that these two are elaborated from category "own". It is due to the definition of "nov_adv" and "fut" that is mostly related to the definition of "own".
We consider four pre-processing that consist of case folding, tokenization, stemming, and stop word elimination. These are done using Apache Lucene library and Weka [19]. All of them will always be applied in all experiments. Then this paper employs Word2Vec representation algorithm from Medallia 1 library. We use 20,157 scientific papers taken from ACL Anthology Reference Corpus (ACL-ARC) to build the word vector learning model itself. After that, we use our annotated preprocessed sentences to output word vectors as feature of words. The weight of word is calculated through TF-IDF and model from Word2Vec representation (Skip-Gram). We compare the usage of them. For Word2Vec, we set its layer size into the vector length of 300. In the end, all classification models are produced from the last process.
Especially for hierarchical classification, we employ SMOTE [20] and SpreadSubsample from Weka to try dealing with imbalance data set. Actually, the difference of this paper from our previous work [11], namely: employing meta-discourse feature from Teufel's; not using previous label; using Naïve Bayes and SVM; doing hierarchical classification technique; and adding data set for 16 rhetorical categories and data set for Word2Vec.   In addition, there are four experiments that is conducted. It is related to implementation of TF-IDF, Word2Vec of Skip-Gram, and hierarchical classification. These are as follows: • Baseline. The features to use are content, absolute location, explicit structure, sentence length, syntax, formulaic & agentivity (meta-discourse), and citation. These are adapted from [4] and [3]; • Scenario 1. In this experiment, all the features from baseline are used. Then, we employ TF-IDF to give score for bag of words from data set; • Scenario 2. This experiment will employ all the features and involve Word2Vec to give score of semantic similarity words by using Skip-Gram algorithm. This scenario is conducted to compare the result between using Word2Vec and TF-IDF; • Scenario 3. This experiment will employ all features from baseline and Word2Vec representation, then we conduct the hierarchical classification from seven rhetorical categories until 16 categories. In second step of hierarchical classification, we also employ SpreadSubsample and SMOTE.

Result and Analysis
First, we built the word vector learning model itself for the 20,157 scientific papers taken from ACL Anthology Reference Corpus (ACL-ARC). Besides that, we use 100 scientific papers for training and 25 for testing. We extracted all the features regarding scenario we planned. This was evaluated by two classifiers, namely Naïve Bayes and SVM Linear.  Table 3 shows the result of first three experiments. After employing bag of words TF-IDF feature in Scenario 1, the performance of baseline experiment using only Teufel's features is still higher than adding bag of words TF-IDF. It can be because co-occurrence of significant words using TF-IDF has been applied for feature of content (Cont-1 and Cont-3). This feature applies TF-IDF more specifically, in which the significant words produced for each document are different. Meanwhile bag of words TF-IDF is more general for all documents. So that it does not have significant impact for the performance.
In baseline, Naïve Bayes produced the highest F-measure result that is 21.40%, but in Scenario 1 and 2 it keeps decreasing when using bag of words. It is proven that Naïve Bayes is not suitable to handle high dimensional feature caused by bag of words, either employing TF-IDF or Word2Vec. This result is different when using SVM Linear. From baseline until Scenario 2, the F-measure result has 6 1234567890 ''"" Besides that, the increasing performance is because either TF-IDF or Word2Vec can capture some important words that often appear in sentences but do not exist in the dictionary of meta-discourse (formulaic & agentivity) from Teufel's [4]. Meanwhile only Word2Vec can capture words that is similar to those important words. Co-occurrence of these similar important words can affect the decision of assigning category to sentences. Some of these words are align, annotate, argument, aspect, concept, consist, context, data, depend, differ, direct, distribute, document, domain, embed, express, extract, feature, function, generate, grammar, include, inform, knowledge, label, language, learn, and many others.
Actually, as shown in Table 3, some categories do not improve at all after applying either TF-IDF or Word2Vec representation. These are "own_fail", "antisupp", and "codi". It can be because these categories have low number of sentences and no related important words appear often in sentences from these categories. This result is different from category "textual". This category significantly improves from experiment of baseline until Scenario 2. It means there are some important words that almost always appear in sentence from this category. Scenario 2 can capture these words, while baseline experiment cannot. Besides that, some categories have fairly stable performance since baseline experiment, just like "own_mthd", "aim", "fut", and "co_gro". It means that sentences from these categories have patterns score of baseline features that often appear. So it has been good enough to define these categories without TF-IDF or Word2Vec. In Scenario 3, we only employ SVM Linear because it has been proven in most previous experiment that this is the best classifier so far for high dimensional feature. Then we conduct hierarchical classification for category "contrast", "own", "basis", and "other" because only these categories are divided into some child-categories [4] [5]. We build new learning model for childcategory from these categories. Its total is four model. The advantage of applying hierarchical classification, we can build learning model for child-category by using the features we select. Every model can be different. Actually, for child-category from "own", "basis", and "other", we employ all features from Teufel [4] and [3], Word2Vec representation, and SMOTE. We employ SMOTE to know that whether it can improve some categories that have F-measure 0.00% in Scenario 2 with SVM Linear or not. Then learning model of child-category from "contrast" employs all features from Teufel, Word2Vec representation, and SpreadSubsample.
The result of Scenario 3 is shown in Table 4. There is no F-measure 0.00% for "own_fail", "antisupp", and "codi". The F-measure result for "own_fail" is 8.30%, "antisupp" 9.50%, and "codi"   Fig 4, the performance of some categories has decreased quite highly, just like "own_conc" and "gap_weak". They decrease about 10% until 11%. It can happen because employing SMOTE in parent-category "own" and SpreadSubsample in parentcategory "contrast". In category "own", we only employ SMOTE for "own_fail", "nov_adv", and "own_res" that have the low number of sentences. Besides that, in category "contrast", we only employ SpreadSubsample for "antisupp" and "codi". This means that SMOTE and SpreadSubsample can decrease the performance of categories that is not set. Actually, the result between Scenario 2 using flat classification and Scenario 3 using hierarchical classification in Figure 4 shows that hierarchical classifier is better than flat classifier, although its difference is not significant. It increases only almost 2% from 27.82% until 29.61%. Hierarchical classifier can be better because we can build different learning model for child-category by using the features we select. For example, we employ SpreadSubsample only for parent-category "contrast" which is divided into "antisupp", "gap_weak", and "codi". We can select features as needed and suitable for group of child-category. In flat classification, we cannot conduct this technique, because we only build one learning model for all categories. Therefore, this hierarchical classification model is considered.

Conclusion and Further Work
The implementation of Word2Vec can improve the performance of some rhetorical sentence classification from baseline scenario. It can be because Word2Vec model can capture some related important words that do not exist in meta-discourse dictionary from [4]. The result is different from implementation of TF-IDF. This feature does not significantly affect the performance, because cooccurrence of significant words using TF-IDF has been applied for feature of content (Cont-1 and Cont-3). This feature applies TF-IDF more specifically, in which the significant words produced for each document are different. Besides that Word2Vec is also better than TF-IDF, because TF-IDF only capture co-occurrence of important words not related words.
Furthermore our work shows that the implementation of hierarchical rhetorical sentence categorization improves the performance of rhetorical sentence categorization, although is not significant. Hierarchical classifier is better than flat classifier because we can build different learning model for child-category. We can select features as needed and suitable for group of child-category. In flat classification, we cannot conduct this technique, because we only build one learning model for all categories. In addition, we do not use previous label in this paper, because in second step of hierarchical classification, we have to build model for group of child-category, in which its label is not structured any more like 16 rhetorical categories. So it is difficult to predict previous label and define "start" label like in [11]. For the future work, we plan to conduct research for every child-category in 16 rhetorical categories from Teufel [5]. It is to know that whether every child-category from same parent-category have high similar category or not.