Reliability assessment of the automatic plagiarism detection system for various editing patterns in documents containing complex mathematical notation

The paper presents the idea and implementation details of a novel tool intended for detection of plagiarism in mathematical content of LATEX files. The proposed algorithm compares LTeX files and detects possible similarities in mathematical formulas, providing an objective similarity score. It offers two modes of comparison: direct (symbolic) mode and verbalized math mode, where the mathematical expressions are first verbalized to spoken language form and compared as such. The solution has been tested against identified various types of plagiarism specific to mathematical symbolic notation and LATeX features, and proved to be an efficient tool for plagiarism detection.


The problem of plagiarism
The plagiarism is the key issue for the science, technology and teaching. Although the extensive reviewer's knowledge in the given field can possibly prevent unauthorised usage of a third party work and results, the contemporary modern world produces enormous amount of data in many publications and many media.
Therefore it is almost impossible to learn about all these achievements in a reasonable time, even for a very experienced expert. This way the thread of plagiarism increases rapidly, especially in cases which require to verify many documents in relatively short time or when the topic is wide and multidisciplinary.
The automatic algorithms detecting repetitions and suspicious similarities between documents seem to be the most appropriate solution for the issues described above. Currently, such algorithms are well designed and tested for common literature texts. They help to prevent plagiarism in the fields of history and other social sciences, but taking into account the majority of technical, mathematical and graphical documents -the challenge remains.

Plagiarism in mathematics
Mathematical textbooks, research papers and theses often contain numerous equations, symbols and mathematical formulas, which usually constitute the real value of the mathematical document. This stays in contrast to documents attributed to other disciplines, where the content is mainly given literally and may be checked for the plagiarism using algorithms developed for plain text documents.
The main problem in plagiarism detection for mathematical expressions is the fact that mathematical formulas and symbols usually require a specialized presentation form or even a separate language, which determine rules and syntax for mathematical content and may be realized in various ways. This issue becomes of particular importance, while designing an antiplagiarism system for mathematics. One of the most popular editing systems for mathematical content is L A T E X, where all mathematical formulas and equations are encapsulated in dedicated, recognizable environments and defined by special commands corresponding to mathematical symbols and operands.
The design of a plagiarism detection tool for mathematics should also take into account possible types of plagiarism of mathematical content. This issue is tightly related with the language used for mathematical elements. In our study we have analyzed various types of plagiarism for documents written in L A T E X. As a result, we distinguished the following types: • splitting formula into few shorter or merging few formulas into single expression or equation environment, • reordering expressions or symbols in a formula, • spacing adjustment (white spaces), • equation environment changes, • transformation of formulas by application of mathematical identities, • transcribing parts of formulas into text equivalent form, • replacement of symbols by other symbols (e.g. y instead of x, α instead of a, etc.).
The above plagiarism schemes were identified by the referees of Master's and Bachelor's degree theses as the most popular types of plagiarism regarding the mathematical content. At the same time, they are often not properly detected by the general purpose anti-plagiarism tools. Since the theses and publications at mathematical departments of universities are usually written in L A T E X, there is a need for a tool for plagiarism detection in .tex files which particularly proves its effectivity when tested against the identified types of plagiarism in math.

Existing algorithms of plagiarism detection
The papers [1,2,3,4] give a comprehensive survey of conventional methods of plagiarism detection. The availability of the reference corpus used by algorithms seeking for plagiarized content is the essential complementary task, very often dependant on the national language and dialects [5,6]. Authors of [7,8,9] discuss the details of various text similarity measures in big data analysis. The results of implementation of dynamic programming (Smith-Waterman algorithm) for plagiarism detection is described in [10].
Recently, there are attempts to provide the contextual searching interface for mathematical documents, which certainly provide the essential technology when it comes to plagiarism detection [11,12]. A comprehensive overview of the methods for computing the similarity of mathematical content is given in [13]. The hybrid approach taking into account significant paraphrases and many parts of the documents, including mathematical equations compared according to histograms of symbols' frequency is described in [14]. The basic method of the mathematical content retrieval and comparison is described in [15]. Details of the mathematical notation plagiarism versions are given in [16] with the conclusion, that equations would possibly be the core feature in the recognition of the plagiarized documents, regardless the pure text content. Therefore the basic approach of plagiarism detection is to compare pairwise equations between the tested documents and evaluate the overall similarity based on the level of similarity of particular pairs. This implies that the computational complexity of these kind of algorithms is of order O(n · m), where the number of equations in the tested documents is n and m, respectively. Most existing solutions use the above general model for plagiarism detection, however they differ in how the particular pairs are matched and compared. Hence the overall complexity of such solution is also dependent on the choice of the equation similarity detection method used. The fine-tuning of the mathematical information retrieval with the use of the analysis of particular identifiers is described in [17].
There is also an outgoing research for the similarity measures for other parts of the documents, as for example bar charts [18] or images [19], as well as classification and categorization of environment scenes [20].
The main novelty of our anti-plagiarism solution for mathematical documents presented in this paper is based on comparison of verbalized content instead of pure symbolic notation of mathematical formulae. In what follows we show how verbalization improves the accuracy of plagiarism detection in various scenarios.

Plagiarism detection in L A T E X documents -tools and methods
L A T E X is a kind of page description language, providing text commands for every part of the document including complex mathematical equations. It is widely used in scientific environment and the majority of Maths students' reports and theses are encoded in L A T E X. Certainly, the essential part of these documents are pure equations and plagiarism detection should be focused at them. In our experiments we consider the documents with source code in L A T E X, which is available for the teacher or examination committee reviewing the particular student's or scientific work.

Formula separation
The equations described in L A T E X are surrounded by obligatory environment commands. Therefore the extraction of a proper data set for further analysis can be reduced to parsing the source code with respect to all allowed math environments, including the most popular \begin{math} and \begin{displaymanth}, as well as dollar and double-dollar signs.
The length of the equation to be taken into further consideration by the anti-plagiarism algorithm should exceed some minimal value. There are many one-letter identifiers used in mathematical and technical documents, which could be indicated as similar to the other short expressions just due to the limited length. The solution is to omit the equations with the description in L A T E X shorter than parameter, which is by default set at 4 characters for our experiments.

Comparison of formulae pairs
Having particular equations extracted from the source code of the document, we need to compare them with one another. The comparison should avoid high computational cost, because of the amount of data which has to be processed: the document with a typical scientific manuscript may contain a few hundreds of equations, while the thesis obviously much more. Given thousands of documents in the repository which should be checked by anti-plagiarism engine for duplicated or similar equations, the entire process would require several hundred thousands comparisons of the equations pairs.
Taking into account the limitations specified above, we incorporated two main text comparison algorithms into our anti-plagiarism system: The Levenshtein distance D Le (α, β) is the minimum number of equally weighted single-character operations: removals, additions and replacements, required to change string α into string (β). Therefore it can be considered as the measure of difference between two text strings regardless their length |α| and |β| [21]. The normalized Levenshtein distance D LeN (α, β) given in Eq. 1 additionally places the resulting values in the range 0 -100%: The second text comparison method is the (weighted) cosine similarity measure. First, two numerical vectors v = [v 1 , ..., v n ] and w = [w 1 , ..., w n ] with coordinates being the counts of instances of a particular key word in the given strings are generated. Then the cosine of the angle between v and w is calculated by means of the standard dot product of the two vectors as presented in Eq. 2: Values close to 1 show high similarity of the vectors, while values close to 0 indicate big differences between v and w.
To increase the quality of the comparison, more important symbols can be emphasized by the introduction of weights. For example, since brackets are commonly used only for ordering the parts of the expression, they should have less impact at the comparison algorithm than root sign or integral sign. In our experiments we have used the highest weights for basic types of mathematical expressions: roots, fractions, integrals, sums, products and limits, medium weights for components, factors, powers and smallest weights for parentheses and brackets. The exact values of the weights were selected heuristically.
Not all of the plagiarism schemes identified in Section 1.2 are L A T E X -specific, especially the last three seem to be of more general type. For this reason, these kinds of mathematical plagiarism appear to be more difficult to detect by the tool oriented on L A T E X files. Therefore, parallel to comparison of pure L A T E X strings representing particular equations, we have introduced verbalization mode. In this mode we assume that spoken versions of the equations can point some similarities not so clearly visible for encoded representation.
The verbalization engine based on the work initially described in [22] and enhanced in [23] is implemented into similarity algorithm. The engine utilizes external definitions made in Lua scripting language. Thanks to that is was possible to provide both English and Polish verbalization modes without much struggle. Certainly, this kind of comparison involves an external scripting language and rather complicated process of the analysis of the equation structure. This, depending on implementation details, can be a time-consuming task, requiring extra processing power or intensive caching, especially for large sets of source L A T E X documents.

Similarity computation for the entire document
To compute the similarity measure for the given mathematical document, we calculate similarity of every pair of equations from the candidate and from subsequent members of the repository. The similarity between documents is the ratio between the amount of similar equations to the overall number of equations.
A pair of equations is considered similar if the chosen similarity measure gives result above the limit, which can be adjusted thorough user interface of the system. The value of this limit is set to 95% by default. The result of the analysis is the list of documents in the repository with the similarity values calculated regarding the candidate. Due to experimental nature of our system we do not filter these results, as different similarity measures and parameters can give slightly different absolute results and the observations of their mutual relations can be valuable.
Additionally, we have introduced one more auxiliary measure, based on the comparison of the entire mathematics in the documents, concatenated into single strings. We discuss the results of this approach in Section 3.2.

Test application
The anti-plagiarism system was developed as a classic desktop application. It was compiled for MS Windows operating system and all the tests were run using Windows 10 Professional. The test application provides simple graphical interface for the parameters setup and the results section (see Figure 1). The operator can choose one input file as the candidate or many files as the repository.
The following parameters can be adjusted: (i) verbalization mode -additional verbalization of the L A T E X notation before comparison, (ii) comparison function -the similarity measure to be used, (iii) minimal length of the equation to be taken into consideration, (iv) similarity level needed for the equations pair to be considered as similar in overall statistics, (v) table switch -an option to show additional formatted data in the end of the report, (vi) log switch -an option to clear the results log before the actual comparison.

Experimental results
In order to test our application against recognition of various types of L A T E X specific plagiarism, a test data set containing a sample reference file and seven modified versions of the reference file has been generated. Each of the modified files corresponds to exactly one of the plagiarism schemes identified and listed in Section 1.2. Additionally, all test files were compared with some other, not particularly related L A T E X documents containing many mathematical expressions and formulas, to check the algorithm for coincidental plagiarism detection.

Sample files
The reference .tex file 1 contains a total number of 21 mathematical formulae written in various environments and styles (e.g. in-line, separate one-line, separate two or more lines). Then the seven test files were prepared by modifying the reference file content according to exactly one of the plagiarism schemes. This way we obtained a set of eight test files (the enumeration corresponds to file numbering): There are several environment types available for mathematical equations to be exposed in L A T E X file. This feature can be used for modification of the content of a .tex file without actual modification of the resulting mathematical content. There are few ways of spacing adjustment in L A T E X documents. Except from commends like \vspace and \hspace, one can force additional spacing in mathematical formulae by using \, \quad, \qquad and \\.

(iv) transformation of formulas by application of mathematical identities
This is a more general, not L A T E X-specific type of plagiarism, where mathematical content is rephrased into an equivalent form by application of identities and laws. For this reason, this type is difficult to detect by automated algorithms.

(v) transcribing parts of formulas into text equivalent form
Similarly to transformation, this type is not L A T E X-specific and also difficult to detect. However it may be detected after proper verbalization of the mathematical content. (vi) reordering expressions or symbols in formula This is another non L A T E X-specific plagiarism scheme, may refer to reordering whole formulas and environments, as well as single symbols within one environment, like for example interchanging a 2 + b 2 = c 2 by b 2 + a 2 = c 2 . (vii) splitting formula into few shorter or merging few formulas into single expression or equation environment This applies to splitting or merging mathematical expressions by for example interchanging a single eqnarray environment with the multiple consequent equation environments, or vice versa (viii) replacement of symbols by other symbols Also a non L A T E X-specific plagiarism scheme, may be applied extensively in various types of documents. This includes renaming the used constants or variables, e.g. a 2 + b 2 = c 2 replaced by x 2 + y 2 = z 2 .
The test files were compared pairwise with each other, i.e. not only the seven modified files were tested against plagiarism with the reference file, but also each of the seven files were checked against all the others.

Results and Discussion
All tests were run in both implemented modes: the L A T E X source mode and the verbalized mode and with default similarity level and minimal equation length settings. Also, in order to determine the efficiency of the discussed similarity measures for detection of specific kinds of plagiarism, the tests were performed for both measures. As the files by design contain exactly the same set of equations, in this test there are no "False Positives" nor "True Negatives" when comparing particular equations, and the calculated overall similarity is the actual recall of the proposed detector at the equation level (i.e. the ratio of True Positives to Total of Actual Positives). Also, the specific design of experimental document samples was aimed at verification of detector sensitivity to different kinds of plagiarism and make precision metric to be irrelevant in this setting. The results are presented in four bubble charts in Figure 2 -each corresponding to one of two similarity measures and one of two operation modes used in the tests. For each pair consisting of a reference file and a tested file the size of respective bubble shows the similarity of the two files. For reference we included diagonal pairs, representing tests on a file compared with itself, with similarity 100%. Clearly, the general effectiveness of the algorithm is noticeably better when using cosine measure. The most effective similarities detection in majority of plagiarism schemes is observed  (5) is poorly recognized as similar to others, but this is not observed to such extent in case of (5) being a test file. As spoken language is more extensive than symbolic, verbalization significantly lengthens the compared strings and the number of constituting words, therefore if they are compared by means of cosine measure, similar or identical formulas contribute more to the similarity score. This tendency is not observed in case of verbalized mode combined with Levenshtein measure, however the algorithm efficiency improved slightly with respect to the Levenshtein -pure L A T E X mode. Another observation is that the algorithm performance is significantly lower for certain types of plagiarism. This especially corresponds to types (5), (7) and (8) (for both measures used) and additionally (3) and (4) when files are compared using Levenshtein measure. Nevertheless, in other cases the overall similarity evaluated by our plagiarism detection algorithm was usually above 70% in case of cosine measure and above 50% in case of Levenshtein measure.
In order to understand the meaning of the above results, we performed additional tests with comparison of the reference file (1) with five non-related mathematical documents, denoted by (i), (ii), (iii), (iv) and (v), containing 109, 24, 36, 26 and 96 mathematical expressions, respectively, among which there were no copies of expressions contained in the reference file (1). The results of the tests are presented in Table 1. For non-related sources, the Levenshtein measure works better for distinguishing the files -especially in verbalized mode, where cosine measured similarity reaches nearly 90%, the Levenshtein similarity is below the level of 20%.