FMN: Two-stage web image fine-grained text detection method based on full convolutional network

In the Internet, there are large-scale web images with different image sizes and different resolutions. In order to accurately and efficiently identify fine-grained texts in web images, this paper uses FCN to support the ability to semantically segment images, and treats text and background as different detection targets. A two-stage web image fine-grained text detection method FMN is proposed. For the problem that the output result of the whole convolutional network is not fine enough, the MSER+NMS algorithm is used to extract the fine-grained features of the FCN output, and the ellipse fitting method is introduced to detect the text. Features such as character tilt and morphological differences ultimately result in text detection of the image. Compared with the existing methods, the FMN algorithm proposed in this paper has achieved certain improvement in detail resolution and precision.


Introduction
In today's Internet era, web composite images are an important medium for transmitting information, and each image either contains complex typography, or contains intensive small text or multiple languages, or contains a watermark background. These problems pose new challenges to the text detection and recognition of images, mainly reflected in the following aspects: 1) The universality of the model, the model can be highly robust in complex layout, multi-word direction and size, watermark text, etc.
2) Fine-grained, for the text detection method, as far as possible, separate a single character independent area instead of the entire text rectangular frame, which is convenient for segmenting a single character for subsequent text recognition. This paper proposes a FMN text detection method based on full convolutional network. The main contributions are as follows: 1) The Oval_MSER and Oval_NMS algorithm is used to detect the text area, and more image edge detail information is obtained, thereby improving the detection accuracy.
2) the ellipse fitting method is used to obtain not only the size information of the region, but also the tilt angle of the text, so that the detection result is more accurate and the segmentation is better. Character area.

Related work
At present, in the text detection method based on deep learning, the method based on text region suggestion and image semantic segmentation is the most widely used. The text detection method based on image segmentation [1;2;3] regards text detection as a generalized "segmentation problem". This type of method usually use the full convolutional network (FCN) to perform pixel-level text/background annotation. For the first time, Zhang Z et al [1] used a Fully Convolutional Network (FCN) to process images from the pixel layer. Zhou X et al [2] proposed a simple and efficient text detection framework based on Fully Convolutional Network and Non-maximum suppression (NMS), which firstly passes full convolution.
The semantic segmentation method can better avoid the text arrangement direction and the text area aspect ratio. however, the subsequent processing of this kind of method is more complicated, which leads to the low robustness of the detection method. Therefore, improving the robustness of the post-processing stage is the difficulty based on the semantic segmentation method.

FMN two-stage image text detection model
The FMN text detection model is divided into two stages to detect the position of the text in the image. The first stage: the original image to be detected is input into the FCN network for forward calculation to obtain the output image of the network; the second stage: the original image and the output image are further detected by the Oval_MSER and Oval_NMS algorithms to obtain the final text position.

FCN text prediction network introduction
The FCN network replaces the fully connected layer of the convolutional neural network with a deconvolution layer, and finally generates a predicted image by upsampling.Its network structure is shown in Figure 2:

FMN algorithm introduction
Algorithm 1: FCN-MSER-NMS text detection algorithm Input : text image to be detected img, FCN network output image Pre_image Output : Image class text area collection Q 1) 1) Using the MSER to obtain a set of predicted text regions Q{( Perform Oval_MSER detection on the area of the original image to obtain Oval_Set 6) Use Oval_NMS to merge the extra detection frames in Oval_Set, and eliminate them to get set K. 7) K+=Oval_NMS(Oval_Set) 8) End for 9) Return K The nouns appearing in the algorithm give definitions and theoretical basis: Definition 1: Aspect Ratio For the pixel point coordinates (i, j) in the extreme value stable region , w = max(i) -min(i) denote the width of the extreme stable region, h = max(j) -Min(j) represents the length of the extreme stable region, and the aspect ratio can be expressed as e = ℎ (1) Definition 2: Boundary Density of Text Images Define the boundary density to represent the boundary pixel intensity of the extreme stable Where f(x, y) represents the boundary image of the region, and hight, width respectively represents the height and width of the stable region.

Data Set 1) MTWI-2018
2) MSRA-TD500 [8] The image text area in the MSRA-TD500 data set has 500 pictures., and most of the text is distributed on the roadside guide cards. and the text includes Chinese, English, and numbers.
3)IDCAR-2015 [9] IDCAR-2015 is a new set of scene-based databases published by the International Document Analysis and Recognition Conference in 2015.

Experimental measures
Ground-truth is calculated by the method shown in Fig. 4, where G and D are respectively represented as Ground. -truth and test results. Considering that it is not convenient to calculate the coverage between G and D, rotate G and D according to their center points ， to the position shown by ′ ， ′ (as shown in Figure 4). .

Conclusion
The FMN algorithm is used for scene text detection. The MSER ellipse fitting algorithm is introduced in the post-processing stage of the full convolutional network output to solve the non-vertical level of the image text area; and the Oval_NMS algorithm based on the fitting result is used to eliminate the