Complex Document Layout Segmentation Based on An Encoder-Decoder Architecture

In our work, we propose an end-to-end encoder-decoder network for complex document layout segmentation. The proposed multi-scale feature extraction network with two parallel branches is applied to further process the feature maps, where one branch enriches the multi-scale information of the feature maps by building feature pyramids, another branch is introduced to capture the dependencies between different locations and integrate long-range context information. Moreover, we merge the outputs of the two branches to enhance the feature representation so as to further improve segmentation accuracy. Experimental results on datasets of PubLayNet and DSSE-200 demonstrate the effectiveness of our proposed method, which yields pixel-wise accuracy of above 99%.


Introduction
Document layout analysis and understanding technology has effectively promoted the digital management of information and widely used in daily life, such as postal automation and ancient manuscript recognition. Document layout segmentation studied in our paper is the most essential and critical step for all further processing including optical character recognition (OCR) and the goal is to split a document image into different regions: background, text, picture, list, etc. However, due to the flexible and complex layout, the changeable element shape, similar categories and the large difference in area size, the research of the document layout segmentation algorithm still faces huge challenges. In our work, we propose a method based on deep learning. Our main contributions are threefold.
 We use a new document image dataset named PubLayNet and achieve good performance.  In order to improve the feature representation, we propose a new network named multi-scale feature extraction network (MFN) which consists of deconvolution pyramid pooling module (DPM) and position attention module (PAM) to process the feature map based on encoderdecoder structure. This is a novel network structure that is different from the previous one.  The proposed MFN contains two parallel branches and the effective combination of the two branches makes the network fully utilize the global and local information of the image and improve the segmentation accuracy.

Related works
In 2017, Wick et al. proposed a high-performance network based on fully convolutional network (FCN) [1] for processing historical document segmentation, which directly learns from the original pixels, omitting the pre-processing steps [2]. Yang et al. proposed a more efficient end-to-end network that uses both the appearance information and semantic information to the segment document image [3]. However, the FCNs methods have a common shortcoming that some small objects are easily lost in  [4,5]. Many researchers think multi-scale strategy is conductive to integrating context information to solve the local ambiguity problem [6]. Consequently, Zhao et al. proposed the pyramid scene parsing network (PSPNet). This method of fusing local and global features together for classification prediction has significantly improved the segmentation accuracy [7]. Moreover, DeepLabs have proposed by Chen et al., the introduced dilated convolution has a larger receptive field but maintains the same amount of computation as ordinary convolution [8,9]. Then in order to obtain multi-scale context information, innovatively proposed the ASPP module, which uses multiple sampling rates to capture context information at multiple scales [10,11]. In 2019, Fu et al. proposed dual-channel attention network (DANet) to improve the classification accuracy by capturing rich contextual dependencies [12].
Combining the above methods and characteristics of document layout, we summarize the following points that need to be addressed when designing the network. Firstly, the shallow detail information should be kept as complete as possible in order to avoid the problem of small objects such as the bullets of list being ignored during down-sampling. Secondly, integrate richer long-range context information as much as possible to help the network better understand the complex layout and improve the accuracy of discrimination between similar categories. Thirdly, attention should be paid to obtaining multi-scale information, so that the performance for objects with large differences in size can be improved. Accordingly, we build our network from the above three aspects to further enhance the feature representation and achieve high-precision document segmentation.

Proposed method
we propose a new method following the symmetrical encoder-decoder structure. Specifically, inspired by PSPNet [7] and DANet [12], we design MFN which can integrate multi-scale information and global context information to further process the feature map. It is worth noting that different from previous document segmentation methods, any post-processing procedures which can further improve the model performance is not necessary in our network, but we still achieve better results.

Network structure
The framework of our proposed method is illustrated as figure 1. In encoder stage, we employ a pretrained ResNet101 [13], the layers of Resnet101 correspond to enc-N(N=1,2,3,4,5) in our network. Note that we change the down-sampling stride of layer1 to 2 and introduce dilation convolution with dilation 2 in layer4. Then the feature map is sent to proposed MFN and we fuse the outputs of the two modules, specifically, we firstly concatenate the outputs of DPM and PAM to obtain a new feature map, and then we perform element-wise sum between it and the original feature map to obtain the finial output of MFN. In decoding stage, we apply the bottleneck structure to build decoder network following resnet101, but we have fewer layers. In order to achieve pixel-wise segmentation, we perform upsampling to restore the feature map size. Furthermore, we adopt skip-connection to keep the shallow detailed information.

Multi-scale feature extraction network
MFN includes two branches DPM and PAM. We propose the DPM which extracts multi-scale features from different sub-regions by creating a feature pyramid. As shown in figure 2(a), the feature map highlighted in blue is extracted by ResNet101. In DPM, varying-size pooling kernels are chosen to perform pooling operations on the feature map, then the feature maps with varied sizes are send to go through the 1 1 convolution layer to reduce dimension. Then we directly up-sample and concatenate these feature maps to form a thicker feature map as the output of DPM. DMP is based on PPM illustrated as figure 2(b) and has made certain improvements. The difference between them is that in PPM, bilinear interpolation is applied to up-sample feature maps, while in DPM we use deconvolution [14] instead, which through continuous learning for better parameters to restore the feature, so that the feature map contains richer semantic context information. In addition, compared with scene image, there are fewer categories in document, so considering segmentation efficiency, we modify the number of pyramid levels and use 3-level pyramid with bin size of 1 1, 2 2, 6 6. For the pooling operation, we choose average pool between max and average because there is no parameter to optimize in global average pooling thus overfitting is avoided in this layers.  However, for pixel-wise segmentation, the direct connection between the underlying pixels is also crucial for classification. Therefore, another branch PAM is introduced next, which is proposed by DANet [12]. PAM focuses on modeling the relationship between pixels, enhancing the relationship between pixels with different positions but the same category. PAM selectively updates the features at each location via aggregating the features at all locations with weighted summation, the similar features have the same response regardless of the distance, thus improving intra-class compact and semantic consistency. As shown in figure 3. PAM uses matrix multiplication to model any two positions to reflect the similarity between them. represents the influence of the position on the position. If the two positions belong to the same category, then will produce a large value. α is a variable scale factor, initialized as 0 and gradually learns to assign more weight. E is the final output of PAM.

Experiment
To evaluate the proposed method, we carry out a series of comprehensive experiments on PubLayNet [15] and DSSE-200 [3] datasets, the experimental results prove the effectiveness of proposed method.

Dataset
PubLayNet is a large dataset for documents layout analysis proposed in 2019, containing over 360 thousand documents images. Typical document layout elements are assigned labels from the following dictionary: figure, table, paragraph, list and paragraph. DSSE-200 is a data set used in MFCN papers, including 200 images which involves 6 foreground region classes and one background class.

Experimental detail
Our end-to-end network is trained on four GTX 1080 Ti GPUs with mini-batch size of 8 and optimized by Stochastic Gradient Descent (SGD). We dynamically adjust the learning rate with iterations instead of fixing it. The basic learning rate and power is set to 0.005 and 0.9 respectively. So the current learning rate equals to the base one multiplying (1 max ) power iter iter  . Momentum and weight decay are set to 0.9 and 0.0001 respectively. The performance has been improved by increasing the iteration number, which is set to 160K for PubLayNet experiment and 1.5K for DSSE-200. The whole implementations are conducted on PyTorch1.0.

Experimental results
In our work, we apply different feature extraction networks to conduct experiments on PubLayNet dataset. As the results shown in table 1, the feature extraction effect of ResNet101 is significantly better than VGG19 [16], because the deep network structure helps to improve the feature representation. SegNet is better than FCN because its symmetrical structure is help to restore the feature, and PSPNet achieves the best performance which can attribute to PPM. Therefore, we use SegNet with ResNet101 as the basic model for subsequent experiments. Firstly, as shown in table 2, we apply our proposed modules to replace the PPM, the results show that MFN brings 2.42% improvements compared with PSPNet, which proves the effectiveness of our method. Secondly, we apply PPM, DPM, PAM and MFN to the encoder-decoder structure respectively, results as shown in table 3. This novel structure with PPM achieves better result of 92.29%, which is 1.19% higher than the result of basic model. The results have proved the point of view in previous work, that is, the importance of Multi-scale context aggregation for segmentation task. We further evaluate the effectiveness of proposed module DPM, the result clearly shows that employing DPM outperforms the model with PPM by 1.99%, and brings 3.18% improvements compared the basic model. Then for PAM, we observe a mIoU of 92.69%, 1.59% better than the basic model, the improvements demonstrates that integrating richer global context information is the key to achieve high-precision document segmentation. Finally, the best performance is obtained when we apply MFN, the model yields a result of 95.42%, which brings3.49% improvement compared with the PSPNet. The solid and consistent improvements demonstrate the effectiveness of the proposed method.

Conclusion
In this paper, we have proposed an encoder-decoder structure for complex document layout segmentation. Note that in the entire network, MFN is the key to effectively improve the feature representation. More specifically, the DPM structure focuses on extracting the multi-scale feature information, while the introduction of PAM is to integrate richer long-range context information from global. Experimental results on some public datasets show that the proposed method achieves good performance. Finally, we'd like to say that the problem of parameter compression is also a challenge we must solve in the further study to improve the speed of document layout segmentation.