A BERT based Chinese Named Entity Recognition method on ASEAN News

As the first step of building a knowledge graph to record the ASEAN counties’ information, we aim to conduct Named-entity Recognition (NER) on the Chinese news about ASEAN counties. We employ a Bi-directional gated recurrent unit to replace the LSTM architecture to improve both models’ effectiveness and capability in understanding polysemous words. The state-of-the-art word embedding model, BERT, has also been included to generate qualified word vectors for the NER task. Besides, we also propose a similarity-based dataset partition method to help model learning the polysemy within the Chinese news. Experiments have been done to demonstrate that the combination of such improvements can benefit the models’ performance in identifying different types of named entities.


Introduction
Due to the Belt and Road diplomatic strategy and the establishment of ASEAN-China community with a shared future, Chinese researchers have done a lot of studies on related works under a wild range of topics. As a sub-problem of building a knowledge graph to record the ASEAN counties' trending information and diplomatic polices, how to automatically extract key knowledge from internet becomes a challenge. Taking a step on addressing this challenge, we aim to utilize the existing Named-entity Recognition (NER) technologies to extract key information from plan Chinese texts.
The concept of NER first appeared in the Information Understanding Conference [3], and it focuses on extracting structured information from semi-structured and unstructured texts. Early researches on this direction were only based on dictionaries, manual rules, and statistical-based methods. With the improvement of computational technology, the appearance of machine learning and deep learning models tremendous push the development of a large amount of Natural Language Processing (NLP) tasks. However, there are still many difficulties when adopting such technologies on Chinese texts. On the one hand, most NLP methods are based on English and infeasible to apply to Chinese texts directly because Chinese texts do not have explicit delimiter, i.e., the whitespace, to separate words. On the other hand, regarding the various types of entities in the ASEAN countries, a large number of polysemous words would be introduced when using Chinese to store information written in foreign languages. For example, the capital of Vietnam is "Ho Chi Minh City", where "Ho Chi Minh" may be misidentified as an adult's name. Besides, abbreviations and aliases are also difficult for algorithms to handle, e.g., "Yindunixiya" (Indonesia) is denoted by "Yinni" in most of the news, and the capital of Thailand "Bangkok" can be replaced by the "City of Angels".
In this paper, instead of the LSTM model, we utilize a Bi-directional gated recurrent unit with conditional random field layers to identify named entities because this model suffers less on the gradient vanishing problem when processing long sequence data and enables a better effect and efficiency than LSTM when processing long sequence data. On the other hand, we introduce the pre-trained BERT model to generate high-quality word vectors, which outperform word2vec methods on word embedding tasks. Moreover, the order of the training examples has a noticeable impact on the trained models' performance, especially when handling polysemous words. Motivated by this, we proposed a similarity based dataset partition methods, which enable model begin to learn the polysemy at early training stage.

Related work
Let us start with a brief overview of the development of Named-entity Recognition (NER) on Chinese texts and word embedding methods.

Chinese Named-entity Recognition
As an essential sub-task of information extraction, NER aims to locate and classify named entities, e.g., organizations, locations, people's names, from unstructured and semi-structured texts [3]. Due to the difference between Chinese and English, which has relatively more straightforward grammar and does not require text segmentation, conducting NER on Chinese text is much more difficult. The development of NER on Chinese texts started in the early 1990s.Su et al. [9] proposed a segmentation and tagging system for Chinese texts. Since then, a great effort has been made to improve the performance of NER methods in this direction, e.g., [4,11,12]

Word Embedding
Currently, most NLP tasks, including NER, relies on word embedding technologies, which project words and phrases in human languages into vector spaces and enable the downstream models to further process such information in a vectorized form for specific tasks. Proposed by Mikolov et al. [6], word2vec toolkit composed Continuous Bag-of-Words Model (CBOW) and continuous skip-gram model methods and significantly boosted word vectors' representation capability. However, we can only get one corresponding vector through word2vec embedding for each word in human language, which means this technology cannot handle the polysemy of human language properly. Peters et al. [7] introduced a Bi-directional LSTM, Bi-LSTM for short, as the feature extractor and improved the performance of word embedding on handling the polysemy. At present, the most widely used pre-trained word vectors are mostly generated by BERT [2], which replaces the Bi-LSTM extractor with Transformer [10] and is currently the state-of-the-art word embedding technology across most NLP tasks.

Methodologies
In this section, we introduce all the necessary components in our Chinese NER method, including data preparation and the details of our model.

Data Preparation
The data set presented herein is actual news from Chinese embassies' websites in ten ASEAN countries. This data set contains 1000 news texts and covers political, economic, trade, education, culture, science and technology, military, and other fields. We utilized a crawler script to download the plain texts of news under the constraints of network regulations. After that, a standard data cleaning has been conducted on the downloaded data, such as deleting redundant symbols through regular expressions and manual rules, labelling the title and main body of those news texts, and unifying their format.
To label the plan texts effectively, we adopted the THU Lexical Analyzer for Chinese (THULAC) [5] toolkit on the cleaned data set for Chinese word segmentation and generating part-of-speech tagging. Then, we double checked the labels manually to correct errors. In this paper, we force on three entities, i.e., locations (LOC), organizations (ORG), and people's names (PER). For a better performance, we employed the IOBES tag [8], which is a variant of IOB, to do sequence labelling. We present an example of the IOBES tag on our dataset in Fig.1.

Gated Recurrent Unit
Although Transformer [10] outperforms traditional Recurrent Neural Networks (RNNs) on many tasks, RNNs still plays an important role when dealing with sequence data. The LSTM network is a classical recurrent neural network architecture. It utilizes three gate structure to control the data flow and decide which part of information should be memorized and which part to abandon. Gated Recurrent Unit (GRU) [1] is a simplified version of LSTM, which only has two gates, i.e., the update gate and reset gate. The update gate determines the information that needs to be deleted by the hidden layer at time , while the reset gate determines the amount of information that the hidden layer needs to retain at time . An illustration of GRU is presented in Fig.2.Give an input , the above procedure can be written as follow: ℎ , , , , ℎ tanℎ ∘ ℎ , , where ∘ is the element-wise multiplication, and , , and represent parameters of the update gate, reset gate, and hidden unit, respectively. Figure 2. An illustration of a GRU neuron at time with a given input . We let , represent the reset gate and update gate, the candidate hidden unit is denoted by , and and tanh denote the sigmoid and hyperbolic tangent activation functions, respectively.
Such design mitigates the gradient vanishing problem when processing long sequence data and enables a better effect and efficiency than LSTM when processing long sequence data. Besides, because the bi-directional property can benefit models to better understand the input sequence, we employ the bi-directional GRU model to extract semantic information in the pre-trained word vector sequences.

The overall working pipeline
In this paper, we employ a BERT-base model that is pre-trained on Chinese texts to generate word vectors. For NER, we train a Bi-directional Gated Recurrent Unit (BiGRU) with Conditional Random Field (CRF) layers to extract features from word vector sequences and identify the named entities. The whole pipeline is visualized in Fig.3.

Text similarity based dataset partition
To make better use of the training data and improve the models' performance, we propose a novel data partition method that is based on a Hash Cosine similarity algorithm, namely HashCos. An intuitive way to mitigate the polysemy is to reduce the similarity of adjacent training examples, which allows the trained model see the polysemous words as early as possible. Therefore, we combine Hash process and cosine distance to quantify the similarity between text examples. This algorithm includes four steps, i.e., word segmentation, hash process, weighted merging, and calculate cosine similarity. The specific process can be described as follow: (1) As we mentioned in Sec. 3.1, the Chinese word segmentation is done by THULAC toolkit. Besides, we also store the top 20 most frequent words and take their frequency as weights that represent the importance of these words in the text.
(2) Based on the segmentation result, we then conduct a hash process on each word, which would transfer words into binary numbers with a pre-defined length.
(3) After the hash processing, we merge all hashed results from a text examples according to their weight that calculated in (1). For example, assume the weight of the word "ASEAN" is 3, and the hash value of "ASEAN" is "101010", that is, the string of weighted numbers obtained is [3,-3,3,-3,-3]. As for the whole text example, we accumulate the weighted numbers of top 20 most frequent words. For instance, the weighted number of "asean", "Thailand" are [3,-3,3,-3,-3], The weighted merging of these two words is [4,-2,2,2,-2].
(4) Finally, the similarity between examples is computed via the cosine distance between the corresponding weighted numbers. In practice, we first randomly shuffle the data set and partition it into three-part by the ratio of 7:1:2, where the largest part is for training, the smallest part is the validation set, and the rest is the test set. For the training set, we randomly select an example as the first example and compute its similarity toward other training examples as described above. Then we sort the rest of the training examples in descending order of their similarity to the selected example.

Experiments
In the section, we conduct experiments to show the improvement brought by pre-trained BERT word embedding and our HashCos partition method.
In the section, we conduct experiments to show the improvement brought by pre-trained BERT word embedding and our HashCos partition method. All experiments have been done on a NVIDIA Tesla V100 GPU with Tensorflow framework. We employ the same Adam optimizer with initial learning rate 0.001 and set the Dropout rate at 0.6 to prevent overfitting for different experiment setups. For comparing to our HashCos method, we also replaced it with the TF-IDF method to do partition.
The experimental results are summarized in~Tab.1, where we evaluate models based on three standard criteria, i.e., precision, recall, and F1, and consider the models' performance on three named entities separately. It can be seen that no matter what kind of word vectors are employed, the models trained on the HashCos partition outperform TF-IDF based one on all named entities and criteria. In addition, we noticed that those organization entities are harder for deep learning to identify than others. It seems that the names of organizations are often abbreviated even in the news, which may increase the

Conclusions
In this paper, we replaced the word2vec method with BERT and introduced a BERT-BiGRU-CRF model to identify named entities from Chinese ASEAN news. Besides, a similarity-based dataset partition method has been proposed to mitigate the influence of polysemous words. This partitioning method is based on our HashCos algorithm, which can compute the similarity between examples effectively. Experiments show that the proposed HashCos based partition outperforms the TF-IDF based method, and combined it with the BERT-BiGRU-CRF model, our method can improve the model's performance substantially on three different named entities.
In the next step of this study, we wish to expand our method on different sources of languages, such as Thai languages and Malay language and eventually design a NER model that can extract key information from multilingual news.