Identification of argumentative sentences in Russian scientific and popular science texts

In this study we analyze the applicability of specific machine learning algorithms to the task of detecting sentences containing argumentation in Russian text. We employ a collection of scientific and popular science texts with manually annotated argumentation to evaluate the quality of identifying argumentative sentences in terms of precision, recall, and F-measure. The experiment involves three algorithms: MNB, SVM, and MLP. The bag of words model is used for representing texts. Lemmas of words in analyzed sentences serve as features for the classification. We perform the automatic selection of informative features in accordance with Variance and χ2 criteria combined with the weight-based filtration of lemmas (via TF*IDF and EMI). The training set includes around 800 sentences, while the test set contains 180. The MNB algorithm demonstrates the highest F-measure and recall scores on almost all feature sets (maximal values reached equal 68.7% and 89% respectively), while the MLP algorithm shows the best precision for about half of feature selection variations (the maximal value is 72.5%).


Introduction
Research on automatic extraction of arguments from Russian texts is not particularly developed (among the few works are [1,2,3]), which is caused by the prior absence of, first, sets of texts in Russian language with annotated argumentation, and second, convenient tools for performing this type of annotation. One way to create argumentation corpora in Russian is through translating annotated texts in other languages, as done in [3] on the argumentation corpus of micro-texts in English and German (ArgMicro); this approach does not require the use of a special annotation tool. The set of web tools for argumentative annotation of texts, developed in the IIS SB RAS [4], provides a means to create thematic corpora, to visualize argumentative statements and argumentation schemes and to represent the argumentation structure of a text in form of a graph with two node types: information nodes (statements) and scheme nodes (reasoning models). These structures can then be extracted in a .json format for automatic processing.
The high labor-intensity of argumentative annotation (which is focused on deep textual structures) means that annotated texts are often insufficient for the efficient training of recognition systems even in the presence of annotated corpora (for example, in the English language). The presented work focuses on a small set of scientific and popular science texts, annotated primarily for analyzing specific reasoning models. In terms of genre characteristics, the chosen texts are distinct for their great lengths and diversity of linguistic expressions (which differentiates them from news articles, online comments, essays and microtexts, frequently used in the construction of argumentation corpora for the English language).
Many applications implementing argumentation analysis rely on the automatic recognition of argumentation structures for the following tasks [5]: 1) processing of product reviews (for analyzing justifications of the expressed sentiments); 2) understanding and directing debates; 3) making decisions in decision support systems; 4) assessing the quality of texts (for example, student essays, scientific articles), etc. In more than ten years of research on English corpora with argumentative annotation [6], the applicability of distinct machine learning algorithms has been evaluated in respect to specific subtasks of argumentation analysis, with the resulting scores becoming available for comparison.
The aim of the present work is 1) to analyze the applicability of certain machine learning methods (multinomial naive Bayes (MNB), support vector machine (SVM), and multilayer perceptron (MLP)) to Russian scientific and popular science texts for identifying sentences containing argumentation; 2) to conduct an experiment on a set of annotated texts and evaluate the quality of identifying argumentative sentences in terms of precision, recall, and F-measure.

Related works
Extraction of argumentation structures, both manual and automatic, is frequently performed in the following steps [7]: a) segmentation of the analyzed text into statements (clauses); b) classification of statements into argumentative and nonargumentative; c) identification of statements roles in the argumentation structure (the main thesis, claims, premises); d) identification of connections between argumentative statements in accordance with reasoning schemes and \ or the topic structure of the text.
Step b) might be skipped for texts with the prevalence of argumentative statements (for example, texts of a specific genre or theme, such as discussions or legal texts). Popular science texts contain a significant portion of nonargumentative segments, which requires a preliminary extraction of the argumentative part for improving efficiency in solving subtasks c) and d).
At times, in classifying statements as argumentative or nonargumentative, researches do not segment text into clauses at the first step, but instead perform the analysis at the level of distinct sentences. Argumentative sentences appear as search objects in works [8,9,11]. Frequently, the extraction focuses only on arguments supporting (or attack) the theses directly [11,13,15]. Some works aim to identify argumentative sentences or clauses serving as claims or premises, for instance, [6,12,13].
Prognostic models are examined in works [16,17], which evaluate respectively the probability of a conference acceptance for a given paper and the quality of product reviews.
As a rule, employed feature sets are based on several types of textual properties: structural, lexical, syntactical, and contextual, along with the usage of markers. The role of features might be taken by ngrams (n ≥ 1), word couples, part of speech tags, grammatical tags (tense or person of a verb etc.), number of punctuation marks [8,9,15], positions in the sentence (inside the main or the subordinate clause), and discourse markers [10,11,12]. Noted features may be enhanced in accordance with search object types (for example, types of arguments, types of supporting arguments) [11,15], rhetoric  [10] or topic [7] structure of the text, etc. The most efficient combinations of features and machine learning methods are often chosen empirically.
Features are frequently assigned binary values. At times they can be weighted, for instance, with the TF*IDF statistic [11,17]. In the present work, we employ a preliminary filtration of features based on TF*IDF and EMI weights, Variance and χ 2 criteria, and their various combinations.
Application of machine learning methods (MNB, maximum entropy, SVM) to legal texts for identifying argumentative sentences [8,9] demonstrates that the best results are reached by the MNB and maximum entropy algorithms trained on n-grams (n = 1, 2, 3), word couples, distinct parts of speech, keywords, punctuation, and text statistic. Precision as the measure of quality is dependent on the training set and reaches 73% and 80% for legal texts. Notably, legal texts are written in the highly formalistic language, which is not the case for popular science texts.
In [3], a Russian argumentation corpus of micro-texts is used for the task of automatic identification of pro and contra statements (the highest F-measure reached is 69.9%).
In the present work, we focus on analyzing applicability of three methods: MNB, SVM, and MLP. The former two are frequently mentioned in related works on argumentation mining.

The task of identifying argumentative sentence
A sentence is considered to be argumentative if it contains statements from the reasoning structure of the text (premises or claims) for supporting or attacking its main theses. The identification of argumentative sentences in texts presupposes the classification of the set of sentences S = {s i } from texts of the collection into two groups: C = {С a , }, where С a labels the class of sentences containing a premise or a claim, while marks the class of sentences without argumentation (С a ∩ = ). The training set is employed to train a classifier for the precise classification : S C {0, 1} that will be as close to the function Learning and testing are performed on the set of expertly annotated examples (with the values of the function F known): S = S L S T , where S L is the training set, and S T is the test set. This enables the automatic evaluation of quality measures (precision P, recall R, and F-measure) on S T .

Methods of the feature selection and classification
In our case, the argumentative classification presupposes performing the following stages:  creating the collection of texts of the chosen genre and \ or theme;  performing the argumentative annotation of texts;  constructing the training and test sets of texts;  preprocessing texts, selecting and \ or filtering features, constructing text models;  training the classifier with the chosen method;  applying the classifier to the test set and evaluating the quality of the classification. The popular science collection is created from the texts of online publications in the corresponding genre, while the scientific collection is based on articles in linguistics and computer technologies from the Ru-RSTreebank corpus [19] (properties of their rhetoric structures have not been considered during the argumentative annotation). For the correctness of the training procedure, we try to approximately equalize the amounts of argumentative and nonargumentative sentences in the training set while the preserving integrity of the source texts, which to a certain extent complicates the training set creation.
The preprocession of texts includes lemmatization of words with the PyMorphy2 program and removal of words without cyrillic symbols. The bag-of-words model is then used to represent texts as vectors of normalized unigrams.
As the linguistic selection of features is time-intensive, in the present work we employ an automatic filtration of features. It is based on weighing vector components via TF*IDF and expected mutual information (EMI) measures, along with selecting unigrams by the use of Variance and χ 2 criteria ( [20]). The TF-IDF measure reaches high values if the term frequently appears in a small set of documents, and low values if the term is encountered in many documents, which enables the filtration of function words and common words with generic meanings (while markers or their constituents might appear among these words, they are not likely to be as efficient in detecting argumentation). The EMI measure allows the filtration of lemmas that appear frequently in different classes, as its values correspond to the quantity of information provided by the presence or absence of the feature in the sentences of different classes; the maximal value is reached when the feature appears only in sentences of the same class.
The Variance criterion (Var) enables the removal of features reaching similar values (with the similarity defined by the chosen threshold) in all training sentences. The value of the χ 2 criterion signifies the difference between the expected and observed probabilities of the feature and the class appearing, on the hypothesis of their independence.
We perform the automatic filtration of features through the use of functions from the Scikit-learn library [21], with the threshold values chosen empirically in accordance with recognition results.
For the classification of sentences, we have chosen MNB and SVM methods [21], frequently mentioned in related articles for solving analogous tasks, along with the MLP algorithm [21]. Deep configuration of algorithm parameters has not been performed as its values would be applicable only to the current dataset and would need to be reconfigured on its planned expansion.
The MNB algorithm. For evaluating the probability of the sentence s i belonging to the class С a or , the MNB algorithm applies the Bayes theorem to the training set. Each sentence is represented as a multiset of lemmas (with assigned binary values for the chosen features), and its class is determined as the one with the highest posterior probability. Features (lemmas) are assumed to appear in the text independently of each other. In calculating probabilities, Laplace smoothing is used to compensate for the training set incompleteness.
The SVM algorithm. In the documents vector space, the algorithm locates the separating surface between two classes that is the most distant from all points of the training set. The radial basis function RBF: exp(− γ‖x−x′‖ 2 ) is chosen as the kernel (in the experiment, the quadratic polynomial function achieves lower recognition quality). Two parameters influence the SVM training with the RBF core: C (regulates the balance between classification errors and the simplicity of the decision surface) and γ (determines the significance of a single training example).
The MLP algorithm. The algorithm consists in training of neural network with the use of backpropagation. For the sentence s i , components of its vector (x 1 , x 2 , ..., x m ) form the input layer. As the activation function we use f(x) = max (0, x). By default, the perceptron consists of 100 layers and optimizes neuron weights by a quasi-Newton method for minimizing the logarithmic loss function. The classifier repeats iterations until reaching either convergence of the specified degree or the maximum number of iterations.

The experiment of argumentative classification
The annotated set used in the experiment includes 96 texts: 10 of which are scientific and 86 are popular science. Each text has been annotated by one expert. A large part of popular science texts has been annotated for the analysis of specific argumentation schemes. Annotations created for this purpose do not always reflect the argumentation structure of a text in a complete or coherent (in respect to the resulting graph) manner, at times the main thesis remains unmarked. Consequentially, the experiment focuses on 22 annotated texts in which the noted tendencies are either absent or minimized via the contraction of the text to its annotated section (preserving the integrity of paragraphs).
We conduct the experiment to evaluate the identification of sentences, which contain arguments, with the MNB, SVM, and MLP methods. Precision, recall, and F-measure scores are calculated for assessing performance of algorithms, and most prominent classification errors are remarked further. The experiment also examines possible ways of optimizing the feature space.

Experimental data
The training S L and the test S T sets (based on 16 and 6 texts respectively) are formed with almost equal amounts of argumentative (from the С a class) and nonargumentative (from the class) sentences. Quantitative characteristics of the S L and S T sets are provided in table 1. During the preprocessing of sets, all words formed by the sequences of non-Cyrillic symbols are the removed from texts, while the others are normalized into their lemmas.

Evaluating the argument identification
At the training stage, we reduce the dimensionality of the feature space via weighting lemmas (with TF*IDF and EMI metrics) and employing Variance and χ 2 criteria, along with their combinations. Classifiers are trained and tested on features (lemmas) selected in accordance with chosen weights and criteria. After the filtration of features, they are assigned binary values on distinct samples (sentences), with each value determined by the presence or absence of the lemma in the sentence.
25% best 825 χ 2 (best)&TF*IDF > p 1 25% best; p 1 = 0.5 791 χ 2 (best)&EMI > p 2 25% best; p 2 = 0.001 200 χ 2 (best)&TF*IDF > p 1 &EMI > p 2 25% best; p 1 = 0.5; p 2 = 0.001 179 χ 2 (best) 10% best 330 χ 2 (best)&TF*IDF > p 1 10% best; p 1 = 0.5 317 χ 2 (best)&EMI > p 2 10% best; p 2 = 0.001 80 χ 2 (best)&TF*IDF > p 1 &EMI > p 2 10% best; p 1 = 0.5; p 2 = 0.001 71 Thresholds have been assigned empirically chosen values, in whose neighborhood each classifier reaches about the same quality scores, which, in turn, decrease on distancing from the chosen value. As each vector is filled with 99% zeroes on average (due to small lengths of sentences serving as samples), the TF*IDF value of the analyzed feature is compared against the threshold at every sentence, and the feature is recognized as informative if it exceeds the threshold at least once. Similarly, the sparseness of vectors influences the choice of a relatively low value for the EMI threshold: for each given feature, its EMI value is calculated over the whole vector in accordance with its frequencies in argumentative and nonargumentative sentences. The Variance threshold filters lemmas that have the same value in more than 99% of the samples (while for a given sentence, the average relative frequency of its lemmas across the training set equals 0,5%). The χ 2 threshold is configured to select the fixed percent of features that are most strongly dependent on their argumentation class. Threshold values remain the same on combining several filtration conditions.
Practically, all of the reasoning models which appear in annotated texts, are represented both in the training and test set. Out of twenty-two distinct models, just two are absent in the training set, and five appear only in the training set (one such model, VerbalClassification, is used mostly in scientific texts, and other six models are generally rare).

5.2.2.
Recognition. Quality scores are calculated for each algorithm (MNB, SVM, MLP) across all feature sets (V i ). The results of the experiment are presented in table 3 (recognition scores are indicated as percent values). The bold font marks two highest scores for each algorithm, separately for precision, recall and F-measure. The third row of the table shows the results of identifying argumentative sentences without filtering features. The following five rows demonstrate scores reached by the separate application of each filter, while the rest show the results of combined filtration. The first column indicates the sizes of feature sets after filtration (filtration conditions are shown in table 2). For all algorithms on aggregate, using one filter separately improves precision in half of cases, recall and F-measure in their third. In terms of precision, MNB reaches the highest values on separate filtration through either TF*IDF or χ 2 (10% best), in terms of recall and F-measure, on combined filters. Reducing the numbers of feature yields, as a rule, a drop in recall; exceptions are SVM with the TF*IDF filtration and MNB with the χ 2 (10% best) one. The F-measure value for MNB increases only on the use of χ 2 (10% best); SVM scores improve with the TF*IDF filtration; MLP reaches higher scores on filtering through either χ 2 (10% best) or EMI.
Simultaneous To summarize, filtration of features improves the overall quality, yet also requires configuring conditions for each algorithm and for each score type.
It is important to note that, as a rule, each classifier reliably identifies either argumentative (MNB) or nonargumentative (SVM and MLP) sentences. Comparable values in identifying the sentences of both classes are rarely achieved. The highest precision score (72.50%) belongs to the MLP algorithm on 168 features where it also reaches its best recall and F-measure. The highest recall (89.01%) and Fmeasure (68.70%) are reached by the MNB method on the sets of 71 and 179 features respectively.
While we are yet to address the efficiency of using linguistic-based features, preliminary results demonstrate that, for example, employing lemmas of specified parts of speech (adjectives, verbs, and adverbs) improves SVM scores in a more prominent manner than those of the MNB and MLP.
The analysis of classification errors demonstrates that the largest part of incorrectly identified the sentences is constituted by sentences either without markers or with ambiguous markers (znachit; poskol'ku; govorit' o tom, chto; ...) which appear frequently both in argumentative and nonargumentative training samples. At the same time, those argumentative sentences corresponding to reasoning models absent in the training set have been identified correctly in the test set due to containing rhetoric and argumentative markers (mozhet byt'; tem samym; otsyuda mozhno naiti; okazyvaetsya, chto; dlya togo, chtoby; napisano, chto; ...).
Results of the conducted experiment are not directly comparable with values presented in related works due to language and genre differences in data, and also to the usage of different quality scores. At the same time, in terms of precision, our best result (72.5% with MLP on 168 features) approaches the value reached in [8] (73%).

Conclusion
In this work we analyze the applicability of three machine learning algorithms (MNB, SVM, MLP) for identifying argumentative sentences in scientific and popular science texts. We show that these methods can be employed in the shortage of annotated training data and linguistic expertise.
The performance of algorithms is evaluated across different feature sets constructed via the filtration of lemmas in accordance with the TF*IDF and EMI measures, Variance and χ 2 criteria. We demonstrate that the filtration of features improves precision in half of the cases, recall and F-measure in their one-third for all three algorithms on aggregate, and share observations on filtering features for each algorithm. The classification quality approaches values reached in related works on larger collections of texts in English language.
The inclusion of new annotated texts into the collection and implementation of the linguistic analysis of features must result in improved quality values in identifying argumentative sentences. It will also enable a comparative study of argumentation in scientific and popular science texts.