Data Augmentation For Chinese Text Classification Using Back-Translation

Text classification is a basic task in natural language processing. When the amount of data is insufficient, the classification accuracy will be greatly affected. We propose to use the back-translation method to expand three Chinese data sets used for text classification, and then train and predict the data sets through deep learning classification model. The results prove that using back-translation to expand the data is particularly helpful on a smaller dataset, it also can reduce the unbalanced distribution of samples and improve the classification performance.


Data Augmentation
Data augmentation [1] is a method of data generation based on visual invariance or semantic invariance, and it is the simplest and direct method to improve model performance. When using smaller data sets, the data augmentation technique can make the model show better generalization ability and performance.
Data augmentation is commonly applied for the field of computer vision [2]. In image processing, rotation, translation, and scaling do not change the meaning of the image. Before inputting data to the model, make small adjustments to the images, such as rotation, translation, scale, random cropping, adding noise and so on. Based on these operations, the previously established network model will learn the transformation form of some samples. For example, without data augmentation, the model usually does not pay too much attention to the position information of the target object during the learning process. It may cause classification error after a simple flip operation. However, one of the most important feature of an ideal model is that the classification result is independent of the position of the target.
In text classification [3] tasks, text data plays a very important role in the classification model. To achieve a good performance for a classifier model, abundant labeled data is often required. However, in many cases, the amount of label data is small and the acquisition cost is high [4], such as product reviews.
At present, the application of data augmentation is minimal in NLP [5]. As a common text data augmentation method, the synonym replacement has been used previously [6][7][8]. From the papers, the best way to do data augmentation is to use human rephrases of sentences, but this is unrealistic and expensive because we have a large number of samples in our datasets. As a result, for us, replacing words or phrases with their synonyms for data augmentation is a good method [6]. It is the application of word similarity to achieve data augmentation [7]. EDA (Easy Data Augmentation) was proposed by Wei and Zou. For a given sentence in the train dataset, one of the following operations is randomly selected and  [9].
In order to expand the text data, we imitate the relevant processing methods of image expansion and propose to use back-translation [10]. Back-translation was used in the 1 st place solution for the "Toxic Comment Classification Challenge" on Kaggle. The winner leveraged idea of machine translations to augment both train and test data sets using French, German, and Spanish translations translated back to English. Back-translation is also often used in machine translation [11] to check the accuracy of the translation [12]. Back-translating the source text is usually converted into the original language after two translations, the original sentence S is translated into other languages (such as English) as S 1 , and then back-translated to original language as S 2 . After back-translation of the target language sentences, the augmentation corpus is composed of S and S 2 . Due to differences in translation software and grammar, the backtranslation is in many ways different from the source sentence from words to complete grammatical structure, which can be understood as expanding the number of datasets, back-translation can generate diverse paraphrases while preserving the semantics of the original sentences, these include synonym replacement [6][7][8], syntactic structure substitution, deletion of irrelevant words [13]. As shown in Table  1, take a Chinese computer review as an example to carry out back-translation through Google Translate.

Related Work
We need a machine translation service to translate to different languages and return to the original language. Google Translate can accurately translate text into many different languages, so the Google Translate API service is very suitable for our translation tasks. We also recommend the use of Google sheets. The Google Sheets provides a convenient translation code. Take Chinese-English-Chinese as an example: GOOGLE TRANSLATE (GOOGLETRANSLATE (A1, "zh-CN", "en"), "en", " zh-CN "). Moreover, with a simple operation, you can apply this formula over the whole column. Therefore, we can also use Google Sheets to achieve the purpose of back-translation.
For EDA, we use an open source python package nlpcda (NLP Chinese Data Augmentation). Install it conveniently through pip install nlpcda and it can help us easily implement Synonym Replacement, Random Swap, Random Deletion, etc. Through the method of synonym replacement and experiments, the following problems were found in the Chinese synonyms replacement: 1) The number of synonym replacements is set manually and not precise enough. For example, short sentences may require 2 keywords with synonym replacement, and long sentences may require 3 or more. For input sentences of different lengths, the synonym replacement algorithm cannot be adjusted the number of substitutions reasonably.
2) The semantics cannot be maintained well after embedding the replaced synonyms in the sentence. It is possible to cause label changes.
In a word, The replacement of each word must be very rigorous, because it may change the meaning of the entire sentence, thus losing all meaning. And we cannot arbitrarily change the order of each character in text processing, because their order represents semantics.
In this paper, the models of text classification we use are LSTM and CNN. LSTM (Long Short-Term Memory) [14] is a special kind of RNN [15] used for the neural network structure of data closely related to the sequence. LSTM has mature applications in machine translation, text classification, QA(Question Answering) and other fields. As the distance increases, RNN cannot make full use of historical information. By improving the structure of RNN, adding memory cell and three gating units, the historical information is effectively control. Instead of completely washing away the hidden state of the previous moment like RNN, which enhances its ability to handle long text sequences and also solves the problem of vanishing gradient.
CNN(Convolutional Neural Network) [16] is one of the important algorithms of deep learning. It is widely used in the field of computer vision and natural language processing. The hidden layer of CNN model consists of three layers. Convolutional layer is responsible for extracting features in the image. Pooling layer is used to significantly reduce the number of parameter (dimension reduction). The fully connected layer is the part of a traditional neural network that outputs the desired result.
In this paper, we propose Chinese text data augmentation based on back-translation, which is used to generate corpus to enrich the sentence pattern or lexical features of text data, and improve performance on text classification tasks. We apply this method to three Chinese datasets used for text classification. Through training and predicting the datasets with the deep learning classification model, by comparison, the final prediction results prove that this method is effective.

Datasets
There are three datasets, including binary sentiment classification (positive and negative) and multiclassification. The first dataset ChnSentiCorp_htl_all (Data1) contains 7766 records about hotel reviews, 5322 positive and 2444 negative. The second, waimai_10k (Data2) includes 12,000 labeled Meituan take-away reviews, including 4000 positive and 8000 negative. The third dataset online_shopping_10_cats (Data3) has more than 60,000 reviews, about 30,000 positive and 30,000 negative, 10 categories and detailed statistics are shown in the Fig.1. These datasets are all from github [17] and have a common feature: the amount of data in different categories are unevenly distributed.

Back-translation
Firstly, we analyze the number of each category in original data. For binary classification, if the two categories differ by more than two times in number, the target category for back-translation is the smaller of the two categories. We set α (α represents the percentage of data to be back-translated) for backtranslation.
In Data1, compared with the number of Positive, Negative is slightly insufficient, select Negative to translate to English and back translate to Chinese. In order to verify the validity of data augmentation, we select a different α and set α = {0.25, 0.5, 1}. As opposed to Data1, in Data2, the number of Positive was only half as Negative, so we choose to expand the data of Positive. In Data3, because the number of water heater is very small, we set α=1 and use two language (English and French) for back-translation, that is, each sentence generates two synonymous sentences. The number statistics after data augmentation is shown in the Table 2. After back-translation, combine these data with the original data to form augmentation corpus and are represented with Data1 α , Data2 α , Data3 α , α = {a: 0.25, b: 0.5, c: 1}.

Text data processing
Firstly, delete punctuation except letters, numbers and Chinese. Then use jieba to split the words and remove the Chinese stop words. Next, the Tokenizer function is used to vectorize a text corpus, by turning each text into either a sequence of integers, each integer represents a word index. Before building the model, we analyze the data to obtain the number distribution of reviews' length, and calculate the average length of reviews to set the suitable max_length in the model, which can ensure that a uniform text length is input to the model. The data in the document is arranged in order by category. Before split training dataset and test dataset, we should randomly shuffling it. For all datasets, we split 10% of the testing dataset and keep the remaining 90% as the training dataset.

Build the model
We implemented the model through Keras. LSTM layer with 128 cells, dropout layer with 0.5, the first dense layer with ReLU activation and the second with softmax. We used categorical_crossentropy loss function with adam optimizer. CNN model starting from a embedding layer. The model convolves the text matrix with filters of different lengths through 3 convolutional layers. Then 2 max-pooling layers is used to operate the vectors extracted from each filter. Finally, each filter corresponds to a number, and these filters are spliced together to extract a vector representing the sentence. We set the dropout rate to 0.5 between BatchNormalization and fully connected layers, and it can prevent overfitting.

Evaluation Metrics
Before the experiments, we first introduce the following evaluation metrics and calculation formulas of text classification. These evaluation metrics we implement through Classification_Report from skit-learn.

RESULTS AND COMPARISON
In all tables, Precision, Recall, F1 score, Support represent the metrics of the category which is be backtranslated. In Table 3, when use Data1 c to train, we could find that CNN has the highest Precision, Recall, F1 score and Test Accuracy, LSTM also has a better performance. The experimental data of the two models indicate that the classification performance has been greatly improved.
As can be seen from Table 4, in the CNN model, the Test Accuracy showed the best score in Data2 a . This shows that the accuracy does not necessarily increase with the increase in the number of data expansion. By the Table 5, Precision decreased significantly, but Recall, F1 score and Test Accuracy increased.   We think it may be that the quality of the enhanced corpus is not good enough, resulting in some additional noise. Although it's not enough to just use two language, we should pay attention to the simplicity and implementability of back-translation. Just for the balance of datasets in Data3, some noise will be added to the training data, even many repeated word features will appear, and we can't get the expected effect.   Table 6 shows the comparison of the best accuracy with the baseline and the performance improvement. Based on all the experimental data, most metrics of the model after back-translation data augmentation have been improved. The average accuracy is 88.1%, which is 2.3% higher than the baseline average accuracy of 85.8%. Although the increase is relatively small, it at least shows that backtranslation is effective.
The data augmentation techniques of EDA and back-translation with LSTM are presented in Table 7. The score is the best Test Accuracy of each and EDA is not as good as back-translation. From Data3, it can be seen that EDA may reduce the performance of the model. During the enhancement process, the meaning of the sentence may changed after random operations, but it remains the original label, resulting in a sentence with the wrong label. This does not happen with back-translation.

CONCLUSIONS
In this paper, We introduce the data augmentation method of back-translation, and prove that can boost the performance of Chinese text classification, especially when training on smaller datasets. In future work, we will investigate the influence of the corpus quality after back-translation on the text classification effect, and whether this method can improve the effect of some advanced deep learning models.