Paper The following article is Open access

Domain Adaptive Chinese Word Segmentation Method for power data security classification

, , , and

Published under licence by IOP Publishing Ltd
, , Citation Yang Dong et al 2021 J. Phys.: Conf. Ser. 1848 012050 DOI 10.1088/1742-6596/1848/1/012050

1742-6596/1848/1/012050

Abstract

Chinese word segmentation (CWS) is an important task for Chinese NLP, and also an essential pre-processing step to establish a word-root database for security classification of power data, covering different domains such as laws & regulations, power. It is impracticable to label a large number of training corpus for each domain, which brings great challenge to the supervised statistical learning method to carry out effective CWS. Therefore, a Chinese word segmentation approach based on dictionary and semi-supervised conditional random field (SS-CRF ) is presented. At first, a CRF model for CWS is trained with self-training and active learning algorithms and used to conduct CWS task. Then the dictionary features are introduced to correct the result of CRF based segmentation by adopting RMM algorithm. Experiments on a cross domain segmentation task show that the proposed method can effectively improve the domain-adaptive performance of CWS.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Please wait… references are loading.
10.1088/1742-6596/1848/1/012050