Improving Neural Chinese Word Segmentation Using Unlabeled Data

Supervised word segmentation heavily relies on large-scale and high quality labeled data. However, building such a corpus is difficult, especially with respect to domain specific data. In this paper, we propose a novel semi-supervised Chinese word segmentation (CWS) method. Specifically, we seek to select more useful sample sentences from the large-scale unlabeled sentences to extend the training data, by means of a sampling strategy that uses character-based semantic similarity. The presented similarity algorithm is used to calculate the similarity between unlabeled sentences and the training data, which can help select helpful sample sentences from unlabeled data. In addition, we integrate an attention mechanism into our word segmentation model to focus on available contextual information. Experiments on PKU, MSR and Weibo benchmark data sets show that our method outperforms the previous neural network models and state-of-the-art methods.


Introduction
Chinese word segmentation (CWS) is a preliminary and important task for many Chinese natural language processing (NLP) tasks, such as machine translations (MT), question answering systems (QA) and named entity recognition (NER). Neural word segmentation has shown promising progress in recent years [1] [2] [3]. However, neural word segmentation, mainly trained by supervised learning, relies on a large-scale labeled data which is usually expensive and tends to be of a limited amount. Lack of labeled data makes it very difficult to capture useful semantic information from contexts to improve the performance of word segmentation model.
For solving the mentioned problem above, researchers have tried to use large-scale unlabeled data to improve the supervised word segmentation. One intuitive idea is to adopt semi-supervised selftraining method. It exploits a small amount of labeled data to train an initial word segmentation model, which is used to annotate unlabeled data. The annotation results are added to training data set to update parameters of word segmentation model in the next iteration. However, there still remains some challenges, for example, some existing semi-supervised word segmentation methods cannot pick out samples containing rich features. The above issues have greatly limited the effect of word segmentation.
In this paper, we propose a novel semi-supervised word segmentation method by leveraging semantic similarity. Our work is mainly about of two aspects to select samples: The uncertainty of annotation and the semantic similarity between unlabeled sentences and training data. Moreover, we adopt an attention-based natural word segmentation to dynamically focus on the character-based n-2 1234567890''""  The contributions of this paper could be summarized as follows.  The semantic similarity is first introduced to semi-supervised learning for enhancing CWS, which is effective to measure the diversity of samples.  For improving the performance of CWS, we integrate an attention mechanism into our CWS model, which is used to dynamically focus on the contextual information and capture useful features (Section 3.2.).  We propose a new semantic similarity calculation method with the help of character-based n-gram embedding, which incorporates rich contextual information (Section 3.3.).

Related Work
Semi-supervised learning is widely used in word segmentation tasks. Zheng et al. [4] and Sun et al. [5] experimented with deriving some statistics-based features such as mutual information, accessor variety and punctuation variety from unlabeled data, and add them to supervised discriminative training. Yang et al. [6] adopted partial-label learning with conditional random fields to make use of the valuable knowledge contained by unlabeled data. Melamud et al. [7] presented a neural model for learning a generic context embedding function from large corpora and applied it to several NLP tasks. Furthermore, Peters et al. [8] directly used large-scale unlabeled corpus to pre-train language models. By using language models, character-level knowledge can be integrated into context vectors and the potential meaning of character knowledge can be well encoded.
One of semi-supervised learning methods in NLP is bootstrapping, which generate more training instances by automatically labeling large-scale data. This is also the main idea of our method. Yang et al. [9], Zeng et al. [10] and Zhang et al. [11] experimented with co-training for semi-supervised Chinese word segmentation. Huang et al. [12] adopted self-training to increasing the quality of the automatically parsed data. Liu et al. [13] combined self-training and character clustering to solve the problem of domain adaptability. Inspired by the previous work, we propose a new and novel sampling strategy for self-training.

Semi-supervised Learning for CWS
Semi-supervised self-training learning can be regarded as a learning method which continuously increases training data. Our method consists of two main stages. In the first stage, we train an initial segmentation model (Attention+BiLSTM+CRF) with a small amount of labeled data, which is used to annotate unlabeled samples (See section 3.2.). In the second stage, we obtain samples from large-scale unlabeled sentences according to sampling strategy (See section 3.3). These samples after annotated are added to training data for the next iteration. We can obtain the optimal word segmentation model until the iterations converge. Figure 1 illustrates the framework of our semi-supervised learning for CWS.

The Initial CWS model
3.2.1. BiLSTM+CRF Architecture for CWS. CWS task is usually solved by character-level sequence labeling algorithm. Specifically, each character in a sentence is labeled as one of , indicating the begin, middle, end of a word, or a word with a single character. For a given sentence containing characters, the aim of CWS task is to predict label sequence . The BiLSTM+CRF architecture (our baseline) for CWS is illustrated in Figure 2. Similar to other methods using neural networks, the first step of the BiLSTM+CRF based CWS method is to represent characters in distributed vectors. Formally, we lookup embedding vector from embedding matrix for each character as , where is a hyper-parameter indicating the size of character embedding.
In order to incorporate information from both sides of the sequence, we use bidirectional Long Short-Term Memory (BiLSTM) as feature layers. The update of each BiLSTM unit can be described as follows: (1) where and are the hidden states at position of the forward and backward LSTMs respectively; is concatenation operation. Following Lample et al. [14], we employ conditional random fields (CRF) layer to inference labels, which is beneficial to consider the dependencies of adjacent labels. For example, a B (begin) label should be only followed by a M (middle) label or E (end) label. Given that y is a label sequence , then the CRF score for this sequence can be calculated as: where is a matrix of transition scores such that represents the score of a transition from the tag to tag . and are the start and end tags of a sentence, that we add to the set of possible tags.
is the fractional matrix of the BiLSTM network's output. corresponds to the j-th label of the i-th word in a sentence.
The output from the model is the tagging sequence with the largest score s(y). A softmax over all possible tag sequences yields a probability for the sequence y: Finally, we directly maximize the log-probability of the correct tag sequence: While decoding, the output sequence we predict will have the highest score given by the following: We use the Viterbi algorithm to find the optimal labeled sequence during training. gei gei hengei <BOS>hengei Figure 3. The attention mechanism. If the character at the current moment is "gei", the attention mechanism gives attention to the n-gram contextual embedding of this character.

3.2.2.
Attension+BiLSTM+CRF Architecture for CWS. It is necessary to use an empirically good segmentation model in the process of semi-supervised word segmentation. Therefore, we improved the performance of baseline system by integrating an attention mechanism. The attention mechanism is illustrated in Figure 3. During training, the attention mechanism selectively pays attention to the character-based n-gram contextual embeddings. The input of the BiLSTM+CRF architecture computed as a weighted sum of the character-based n-gram contextual embeddings: The weight of each annotation is computed by: where (8) here, represents the character-based j-gram contextual embeddings, function is tanh function.
is trainable parameter which denotes weight, denotes the output of the attention mechanism. Because of the sparsity of the n-gram contextual embeddings of characters, here we set N to 3. Specifically, we only focus on the unigram, bi-gram and tri-gram contextual embeddings of characters. In order to get these contextual embeddings, we first deal with the sentence. After processing, this sentence "woshizhongguoren." (I am Chinese.) will be "wo shi zhong guo ren .", "<BOS> wo woshi shizhong zhongguo guoren ren. .<EOS>" and "<BOS0><BOS>wo <BOS>woshi woshizhong shizhongguo zhongguoren guoren. ren.<EOS> .<EOS><EOS0>" respectively. Then we train these vectors readily using word2vec [15] toolkit on the resulting n-gram sequences.

Sampling Strategy
3.3.1. Similarity between unlabeled sentences and training data. To extract useful samples from the large-scale unlabeled corpus, we propose to calculate the character-based semantic similarity between unlabeled sentences and training set sentences. In general, the more a sample sentence differs from the training set sentences, the more semantic and structural features it contains. Thus we seek to select sample sentences that are very different from training set sentences and they can then be easily deduced from the similarity between unlabeled sentences and training set sentences. Formally, the character-based semantic similarity is defined as follows: where denotes unlabeled sentence vector and indicates training set sentence vector. denotes the distance between two vectors, which can be calculated via cosine distance and Euclidean distance.
is the weight of the distance calculated by three sentence vectors. We define sentence vector as: (10) where represents the n-gram contextual embedding of the i-th character. Here are three kinds of sentence vectors due to the different values of , which is set to 1, 2, and 3 respectively. We obtain these vector representation according to section 3.2.2. denotes the length of a sentence. is similar to .

The Uncertainty of Annotation.
Higher the uncertainty of a sample sentence's annotation, more useful features the sample sentence contains. In our work, therefore, we choose the samples with higher uncertainty in annotation task from unlabeled data.
Additionally, the word segmentation can be represented as a two-class problem according to one factor, that is, whether the character is the right boundary of a word in a sentence. Specifically, the labels B, M, E, S can be divided into two categories: B and M can be grouped together, denoted as N, indicating that the character isn't the right boundary of a word; Also, E and S can be grouped and denoted as Y, indicating that the character is the right border of a word and need to be divided.
We use information entropy to measure the uncertainty of sequence annotation and it is formally denoted as: (11) where, c is a character from unlabeled sentences, , , represents the posterior probability that c is marked as B.
and are similar to .

Sampling Strategy.
Due to the lack of labeled data, CWS models cannot learn enough features. In general, if a sample sentence is more different with sentences of the training set, the sample sentence contains more semantic features. Therefore, selecting useful sample sentences relies on two aspects: The uncertainty of annotation and the semantic similarity between unlabeled sentences and training set sentences. In general, the scoring model for selecting sample sentences is finally defined as:  (12) where, represents a sentence in unlabeled data, indicates the character number of the sentence, represents the character in the unlabeled sentence, is a training set sentence, is a weight parameter. The higher the score, the more valuable the sample is.

Datasets
We evaluate our proposed methods on three prevalent CWS datasets, including NLPCC 2016 Weibo dataset [16] and PKU, MSR from SIGHAN2005 [17]. Table 1 gives the details of three datasets. We use 10% data of shuffled training set as development set for PKU and MSR datasets. All datasets are preprocessed by replacing numbers and continuous English characters with special flags. The performance of our methods is evaluated by F-score.
In our experiments, we use about 2.2 million Weibo unlabeled data built from the Internet. Moreover, we utilize about the same 2.3 million Sogou Lab's news corpus for PKU and MSR datasets.

Parameter Settings
The Hyper-parameters used in the experiments are shown in Table 2. We use different training batch sizes for three datasets. Specifically, we set batch sizes of PKU and MSR datasets to 512 and 256 respectively, and 128 for Weibo dataset.

The Results of Initial Word segmentation model.
The initial word segmentation model for the semi-supervised method in this paper is an attention-based natural segmentation model (Attention+BiLSTM+CRF). Due to the sparsity of data, our experiments only focus on the unigram, bi-gram and tri-gram vector representation of characters. The results of our experiments are shown in Table 3. We can see that the experimental result in first group is lowest, which demonstrates the effectiveness of character contexts. To prove the effectiveness of the attention mechanism, we add  Table 3, the average performance is boosted by 0.30% on F-score from baseline to attention-based model, and 0.10% on F-score from concatenation-based model to attention-based model when set N as 3, which indicates that the attention mechanism is effective to integrate contextual information. In addition, we observe that our attention-based segmentation model can achieve an improvement on small dataset (e.g. Weibo and PKU) and have less effect on dataset with enough training data (MSR). In order to ensure the precision of word segmentation model, we finally choose Attention+BiLSTM+CRF (N=3) as our initial word segmentation model. a : Our baseline neural word segmentation, which uses the character embeddings as input. b2 : The concatenation of the unigram and bi-gram embeddings are used as the input. b3 : The concatenation of the unigram, bi-gram and tri-gram embeddings is used as the input. c2 : The Attention-based segmentation model that focus on the unigram and bi-gram embeddings. c3 : The Attention-based segmentation model that focus on the unigram, bi-gram and tri-gram embeddings.

The Results of Semi-supervised Learning Segmentation.
We first investigated the impact of in formula (12) over segmentation performance. The parameter is searched ranging from 0.1 to 1 with a step size of 0.1. We put into the formula (12) and select the most valuable samples according the sampling strategy. In each iteration, 1000 samples were selected. The number of iterations was 5. According to Figure 4, it can be seen that the segmentation effect is best when is set to 0.5 in two cases. Two kinds of methods for calculating distance have little difference in Figure 4. The cosine distance gives a little improvement compared to Euclidean distance.
To verify the effectiveness of our sampling strategy, we conduct a comparative experiment which choose different number of samples randomly. We apply the sampling strategy proposed in Section 3.3 to the unlabeled data. The annotation work of unlabeled data is based on the attentionbased segmentation model. Experiment results are shown in Figure 5. It can be seen that with the increase of the number of datasets, the performance of our methods becomes better, stable and convergent. It shows that our proposed method can learn continuously knowledge from selected instances. Our method achieves better performance compared to the method with selecting randomly samples, demonstrating that sampling strategy is helpful to improve the segmentation accuracy.
Furthermore, we add a comparative experiment, which uses the word-based semantic similarity. This method calculates similarity based on word-level. For unlabeled sentences, we use initial word segmentation model to annotate them. The experimental results are shown in Table 4 word-based semantic similarity representations, the character-based method outperforms the former on three datasets. The results show that character-based models have the potential of capturing morpheme patterns, thereby improving generalization ability of word segmentation models.  Our final results compare favourably to the best neural models, including that using adversarial learning [1], and that leveraging abundant pretraining [2]. As shown in Table 5, our method gives the best accuracies on all corpora except for Weibo, where it underperforms the method of Yang et al. [2]. Compared with the state-of-the-art semi-supervised method [5] in the previous, our semi-supervised method is more stable, and the performance becomes better with the increase of unlabeled data. Experimental results verify that our method is simple but robust, which is of practical importance.

Conclusion and Future Work
In this paper, we propose a novel semi-supervised CWS method based on semantic similarity. Specifically, we propose a sampling strategy, which help us select more valuable sample sentences from massive unlabeled sentences to extend the training data. At the same time, we introduce character-based semantic similarity to the sampling strategy. Furthermore we introduce an attentionbased neural segmentation, which dynamically focuses on the useful contextual information. The experimental results on the Weibo dataset fully show that our method achieve better word segmentation by frequent iteration. More importantly, it also shows that our sampling strategy effectively choose samples with richer features. In addition, experimental results on PKU and MSR datasets show that our method has cross domain robustness.
In the future research, we hope to use the method proposed in this paper to automatically acquire training corpora. Moreover, we wish to optimize the semi-supervised learning sampling strategy so that it can have better domain adaptability.