Chinese Text Summarization Generation Algorithm based on ERNIE-GEN with Few-shot Learning

In view of the problems of exposure bias and insufficient annotation data in the existing Chinese short text summary algorithms, a new pre-training short text summary algorithm named ERNIE-GEN-CTS based on ERNIE-GEN is proposed. This paper uses pre-training language model to realize the double attention of word level and word level for short text, and fills the sub block with semantic information to optimize the exposure bias problem. After the pre-training model of large-scale Chinese corpus, the results are improved in LCSTS data set after the task tuning of the downstream summary, and the validity and stability of the model are verified.


Introduction
Different from the extraction and combination of important sentences in extractive text summarization, abstractive text summarization needs to analyse the structure and semantics of the original text, and regenerate the abstract text. With the evolution of deep learning and end-to-end generation model, sequence to sequence model has become the core model of generative summarization in recent years, and the effect is more prominent than that of extractive summarization. Since the development of deep learning, with the proposal of transformer framework, the original sequence to sequence model based on RNN, LSTM, GRU and other models as encoders and decoders has been developed into the sequence-to-sequence model based on transformer, and its framework has gradually formed the main framework of deep pre-training model. The demand for training set is also transformed from supervised learning to unsupervised learning. In the case of lack of enough labelled data, the effect has been greatly improved. Aiming at the problem of insufficient training set for Chinese text summarization and combining with the more and more popular pre training model, this paper analysed the effect of Chinese abstractive text summarization algorithm based on ERNIE-GEN [1] with few-shot learning.

Related work
At present, the research on the abstractive text summarization is studied by domestic and foreign scholars to different extent, and different improvement methods are put forward for the common core problems such as the out of vocabulary, generation repetition, long-range dependence and evaluation standard of the abstractive text summary. See et al. [2] proposed the pointer generator network (PGN), which is the replication mechanism. In view of the problem that the basic model of sequence to sequence often does not accurately reproduce the details of facts, the OOV problem is relieved by copying words from the source text targeted by pointer, which allows summary words to be generated by pointer copying the words of the source document. It also allows sampling and generating from fixed word base under certain probability. This paper proposes a coverage mechanism, which uses attention distribution to track the words that should be selected at present. When attention points to the same content as the previous step, it will be punished to solve the problem that there are frequent repetitive fragments in sentences generated from sequence models based on attention. Celikyilmaz et al. [3] used LSTM to extract the semantic representation of sentences, and uses DCA to solve the problem of how to better gather information in long distance. The maximum likelihood estimation, semantic cohesion, sentence by sentence reinforcement learning strategy are used to improve the accuracy, consistency and abstraction of abstracts, it also used the self-critical training method of intensive learning to calculate the nondifferentiable Rouge function. Yang et al. [4] aimed at the problem of limited input length of Bert, the paper proposes that the problem can be solved by applying reasoning to the sentence alone, and then aggregate the sentence scores to generate document scores. This idea can alleviate the long-range dependence of Bert. Li et al. [5] proposed that the SCST technology was used to optimize the SCST technology directly in convolution sequence to sequence frame, which alleviated the exposure deviation problem and realized the calculation of the irremediable summary measurement rouge. Xu et al. [6] combined graph convolutional network model, uses graph to connect the analytic tree of sentences in documents, and uses stack graph convolution network to learn the syntax representation of documents. Through selective attention mechanism, it is used to extract significant semantic and structural information and optimize the generation of summary results. Zou et al. [7] had achieved comparable supervisory effect by integrating the unsupervised training method with self-built large-scale corpus.
According the deep learning of the generated text summary model, we can find that the special model is specialized in solving specific problems, so it has its own algorithm core, codec, application scope, advantages and limitations, etc. As a result, it needs to be used after research according to the actual situation, and then integrate a better automatic text summary model. In view of the situation that the pretraining work of natural language generation seldom pays attention to the exposure bias [8] of downstream tasks, which is that the basic fact words are used in the training process, and the words generated are used for the reasoning that is easy to accumulate errors, whether the prediction is correct or not. ERNIE-GEN is an enhanced multi-flow sequence to sequence pre-training and fine-tuning framework, it proposes a solution, which is very helpful for generating tasks such as summary.

Data preprocessing
In order to unify the input of the data set, the data are processed uniformly, the same amount of data is intercepted for the experiment, and the training input data is converted into a more suitable format for ERNIE-GEN model before training, as follows: < serial number + '\n' + source texts + '\n' summary >.

ERNIE-GEN pre-training model
3.2.1. Model architecture ERNIE-GEN model mainly includes noise module, attention module and objective function, and its framework is shown in figure 1. In the noise module, super parameters ρ are set to control the text replacement ratio in the two stages of pre training and fine-tuning. The attention module is divided into word level and word level attention based on semantic information filling sub block. The attention mode of word level is consistent with that of pre training model such as Bert. Span level, as its innovation, improves the degree of semantic association within and between words. The proposition of semantic information filling sub block [ATTN], promotes the training of reasoning stage to generate summary based on the overall semantics. In order to avoid the influence of the previous output error on the overall summary generation, the paranoid exposure problem of pre-training model such as Bert is improved. The loss function combines the loss of word level attention module and word level attention module by using super parameter λ, and reduces the loss caused by the two models.    where a l i indicates the i-th vector representation of the l-th layer for the artificial symbol sequence A W . The continuous selection of span is realized by two steps. In the first step, the number of n-grams is determined by t-test. Based on the number of n-grams, 200000 binaries, 50000 triples and all words are filtered to form a span vocabulary; The second step is to label all triples, triples and words in order based on the span vocabulary. Then, the semantic information is added to fill in the sub block [ATTN] after each input word processed by noise. Each [ATTN] is calculated separately, but the [ATTN] of the words belonging to the same span are the same. It is computed as follows:

Contextual multi-flow attention mechanism
vector representation of the i-th span.
Finally, by sharing the context stream parallel computing, the fusion of the two generated streams is computed as follows: where X denotes the concatenation of S and T', X l is the vector sequence of the l-th layer for the contextual flow. A l W , S A l are vector sequences of the l-th layer for the word-by-word and span-by-span generation flow respectively.

Loss function
The objective function of ERNIE-GEN model is generated based on contextual multi-flow attention mechanism, and the formula is as follows: Among them, the pre training stage is set up λ= 5, control to reduce the loss of word level and word level generated streams, while in the fine-tuning phase, set λ= 1.0, which can make the abstracts generated by reasoning output in terms of words, and improve the effect of abstracts generation.

Experiments
In this section, we will show and analyse the data set, evaluation index, experimental environment and experimental results.

Evaluation index
The evaluation criteria of abstractive automatic text summarization can be divided into manual evaluation method and automatic evaluation method. Artificial evaluation means that experts make artificial evaluation, and comprehensively consider the fluency, central idea relevance, interpretability and other aspects of the abstract. The following mainly introduces the evaluation criteria of automatic evaluation, which are divided into internal evaluation and external evaluation criteria. The internal evaluation criteria include information quantity, coherence, readability, length, redundancy, etc., while the external evaluation criteria are indirect evaluation, including retrieval accuracy, classification accuracy, etc. Among the abstractive automatic text summarization tasks, the most commonly used evaluation criteria is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [9] . Rouge focuses on the recall rate, compares the automatic summary generated by the system with the standard summary generated manually, and evaluates the quality of the summary by counting the number of overlapping basic units between the two. In the current environment, rouge is the most commonly used automatic text summarization evaluation standard, while ROUGE-1, ROUGE -2 and ROUGE -L are the most commonly used standards in the series to evaluate the effect of automatic text summarization. The formula is: For the reason that the rouge module itself cannot evaluate the quality of Chinese abstract, we need to transform Chinese into vector representation, and then calculate the vector by rouge. The vector representation can be word embedded vector or ID. in this paper, the ID method is used to cut the character level of each output, and based on the established thesaurus, the result is transformed into the vector in the form of ID.

Experimental environment
In order to verify the performance of Chinese text summarization method based on ERNIE-GEN model, this paper uses labelled text data set to verify and analyse it. The experimental environment and

Experimental results and analysis
In the experiment, the pre-trained ERNIE-GEN model is used to fine tune the Chinese abstract data set, in which the consistent parameter is set to the following values: input batch size is 32 and the maximum encoder input is 256 tokens, and the maximum decoder input is 64 tokens, the learning rate is 5e-5, warmup proportion is 0.1, the weight-decay is 0.1, the beam search size is 5, and its length penalty is 1.0. Table 2. Performance comparison of five models. Model RG-1 RG-2 RG-L RNN context [10] 29.90 17.40 27.20 COPYNET [11] 34.40 21.60 31.30 DRGD [12] 36.99 24.15 34.21 RNN+MRT [13] 38.20 25.20 35.40 ERNIE-GEN-CTS 38.71 26.39 36.01 The fifth group's model ERNIE-GEN-CTS is the model of this paper. It uses the context attention mechanism of combining word level and span level generation flow to construct attention mechanism, and trains encoder and decoder, and introduces weight attenuation L2 regularization and beam search method. From table 2, it can be seen that under the ROUGE evaluation system, the rouge-1, rouge-2 and rouge-l values of ERNIE-GEN-CTS model are higher than those of the first four groups. ERNIE-GEN-CTS model introduces the semantic substructure of [ATTN], which effectively compensates the problem of exposure bias in reasoning stage, improves the amount of information entering decoder, and greatly improves the summary effect. At the same time, the model of this paper uses cluster search method at decoder side, and improves the accuracy of the abstract to some extent.
As shown in Table 3, this paper first uses all the training set data in LCSTS for fine-tuning, and then takes 3000 iters as the standard to summarize the results under the same training times environment under different data volume. The data are intercepted as 10000,50000,100000,1000000 and not Table3 figure 3~8. The analysis shows that when the data volume is 50000, the model effect is equivalent or even better than most models, It shows that the model still has good reasoning ability in the environment of small-scale data.   Chinese summary data pairs. The six small data sets are used for fine-tuning test, and it is found that the training effect of the model is different under the same scale. The difference between the best and the worst of rouge-1, rouge-2, rouge-l is 1.07, 0.94 and 1.04 respectively, but the results are similar on the whole, which shows that the generalization ability of ERNIE-GEN is good.

Conclusion
This paper proposes a Chinese short text summarization model ERNIE-GEN-CTS based on ERNIE-GEN pre-training language model. The attention module of context fusion in the model effectively optimizes the exposure bias problem in the past summarization model, and achieves good and stable results in small sample Chinese data sets. It has a good enlightenment for the future Chinese summarization model. However, there is still a lot of room for further research, including the short text summary data set outside LCSTS, the applicability of medium and long text summary data set, the supplement of attention flow such as longer word level and longer exclusive nouns.