The analysis of the graphical representation of mathematical formulas for extending descriptions of physical effects

The aim of the study is the software development of a methodology for analyzing graphical representations of mathematical formulas to expand the descriptions of scientific and technical effects. To achieve this goal, an analysis of existing systems for recognizing graphic representations of mathematical formulas was carried out, an algorithm for editing a PE database with integrated physical formulas and an algorithm for recognizing formulas and explanations were developed. Based on the developed methodology, an automated system is implemented in software. Thus, a technique for analyzing graphical representations of mathematical formulas has been developed to expand the descriptions of scientific and technical effects and an automated system created on its basis.


Introduction
In the database of physical effects (PE) [1], there are more than 1400 descriptions. These descriptions often lack formulas describing the physical processes of the effect.
One of the most updated sources of information on physical topics is the patent array, in particular, RosPatent, as well as journals of physical topics, abstracts, and dissertation texts containing descriptions of mathematical models.
Let us show an example of a graphical representation of the mathematical formula for PE in the patent [2].
A fragment of the patent text containing a formula, description of the formula, and descriptions of physical quantities from the formula, respectively: "… Calibrated magnetic measurements, are calculated by the following formula: where D mag is the magnetic displacement state matrix corresponding to nine displacements from the soft iron; is the magnetic displacement state vector corresponding to displacements from solid iron; and represents three-dimensional magnetic measurements ..." The html structure:

Materials and methods
Thus, the task of automating the expansion of the description of the physical effect by replenishing the topic "Mathematical model of PE" is urgent. The purpose of the work is to implement software development of a method for analyzing graphical representations of mathematical formulas to expand the descriptions of physical effects.
The analysis of existing systems for the recognition of graphic representations [3] of mathematical formulas was carried out (Table 1). Table 1. Comparison of recognition systems/ Software a b c d e Tesseract OCR -0 5 1 -InftyReader + 2 6 5 + Mathematical Expression Recognition + 2 1 1 + MathType + 2 2 2 + where a -the accuracy of recognition of mathematical formulas consists of the following estimates: "+" -"the accuracy is sufficient for applied use", "-" -"the accuracy is insufficient for applied use"; b -API convenience consists of the following ratings: 2 -"API is convenient", 1 -"API is inconvenient", 0 -"API is absent"; c -the number of input image formats; d -the number of formats of recognized formulas; e -opportunity to output recognition results in LaTeX format "contains estimates: "+" -"Output in LaTeX format is possible" and "-" -"Output in LaTeX format is not possible".
Based on the comparison of recognition systems, it turned out that the most convenient for solving the problem of extracting data from physical formulas from patent texts is the MathType [4] system in connection with a convenient API in combination with input image formats and output formats of recognized formulas that are practical in the authors' opinion.
As a result, an algorithm for editing a PE database with integrated physical formulas was developed ( Figure 1) and an algorithm for recognizing formulas and explanations to them (Figure 2).   Based on the developed methodology, the software was developed, which provides the following functions:  recognition of graphical representations of physical formulas from patent texts in LaTeX format;  extraction of explanations to the recognized formula from patent texts (description of physical quantities used in the formula);  entering physical formulas into descriptions of physical effects in the database in LaTeX format. Software is implemented in the languages Python (version 3.6) and Javascript. To recognize formulas from graphic images, the Mathpix OCR API is used. The BeautifulSoap [5] library is used to find an explanation of the formula recognized in the patent. The request library is used to query the patent web pages.
The architecture is shown in Figure 3.

Results and discussion
The developed software displays (Figures 4,5):  information on the results of extracting formulas from patent texts;  the output of extracted formulas with explanations of physical quantities;  descriptions of physical effects with the topics "Mathematical model" filled in with the extracted formulas.

Conclusion
The theoretical value of this work lies in the developed methodology for analyzing graphical representations of mathematical formulas for expanding the descriptions of scientific and technical effects and the software created on its basis.