A General Framework for Chinese Domain Knowledge Graph Question Answering Based on TransE

This paper introduces a general question answering (QA) framework for Chinese domain knowledge graphs to serve various industries. The question is analyzed by shielding invalid characters in this framework, and the TransE calculation of entities and relations is used to search answer, rather than traditional logical queries that rely heavily on predefined rules. In this way, the QA framework can be easily applied to different fields with little manual participation, and reduce the migration cost caused by the lack of scalability of the existing QA systems, which are designed only for specific domains. Then, the proposed framework is operated on the space launch knowledge graph and has achieved HITS@1, HITS@3, HITS@5 scores of 60.60%, 80.87%, 85.61% separately, which is competitive with other restricted domain knowledge graphs. In addition, numerical results on the knowledge graph of the Three Kingdoms and the COVID-19-Character verified the scalability of the QA framework.


Introduction
Knowledge graph (KG) is a kind of semantic network that reveals the relation between entities, which is able to describe things in the real world and their relations. According to the coverage, KG can be divided into general knowledge graph and domain knowledge graph [1]. General knowledge graphs focus on breadth and the integration of more entities, such as Google Knowledge Graph, FreeBase, and WikiData, which are mainly used for Internet-oriented search, recommendation, question answering (QA) and other business scenarios; in addition, domain knowledge graphs are oriented to specific industries and rely on data in domains, usually have certain depth and completeness, which make sophisticated application analysis as well as decision support possible in kinds of certain domains, such as medical, finance and tourism. Compared with the general knowledge graph, the domain knowledge graph has the advantages of clear subjects, rich entity attributes and data formats, high accuracy, which are suitable to build a QA system for different business scenarios and users.
In recent years, the Chinese domain knowledge graph question answering (CdKGQA) has made great progress, however most QA systems are limited at specific industries and are difficult to expand. Therefore, the design of a general QA framework is imperative. Reference [2] designs an ontologybased aviation QA system, which analyses question through morphological analysis and ontology-based classification, then the question is converted into SPARQL query based on the template for answer extraction; Reference [3] constructs a QA system based on military knowledge graph, which combines shallow syntax analysis and question templates to identify entities and users' intentions, then query answer on the graph database; Reference [4] introduces a pipeline QA system to knowledge graph of  [5] combines CRF, template matching and SQL query to design an intelligent QA system based on knowledge graph of Chinese medicine; Reference [6] implements QA system on knowledge graph for Princes Jingjiang Residence, the word substitution and semantic similarity between question and template are used for question parsing, then the Cypher query and Levenshtein distance are used for answer generation.
However, a major problem with the traditional QA system is its weak scalability, which means too much manual participation for different types of questions and logical query templates. Besides, few researches consider the particularity of questions in Chinese scenarios: (1) Unlike English sentences, Chinese sentences are composed of a series of Chinese characters, rather than separated by spaces; (2) Chinese expressions are more diverse, such as terminology, aliases, and abbreviations exist everywhere especially in specific fields; (3) Chinese questions generally conform to the sequential logical syntax of "subject-predicate-object".
Based on the work of previous scholars, this paper designs a general QA framework for Chinese domain knowledge graph, which recognizes entities and relations by shielding invalid characters, then introduces the TransE into the answer extraction module. Experiments and analysis of the QA framework on the knowledge graph of the space launch show the framework is feasible, then the implementations on the knowledge graph of the Three Kingdoms and the COVID-19-Character verify the scalability of the QA framework.

Related work
To design the general CdKGQA framework, this paper uses stop words and word embedding to parse question, and uses TransE to reason and search answer. In this section, stop words, word embedding, and TransE embedding are introduced respectively.
For natural language processing tasks, filtering stop words can improve the efficiency and accuracy of text processing, that is, the system could automatically remove certain words. However, no one stop words list is suitable for all tasks. In search systems, stop words are words that appear frequently but are not helpful for the result [7]; in classification systems based on support vector machine, stop words refer to function words without meaning and neutral words with weak category features [8]; in QA systems, stop words would change dynamically considering different questions [9]. As for the QA system in a specific field, the domain stop words list is extracted through probability and content analysis after checking and analysing a large amount of text in this field, include text frequency, word frequency statistics, entropy calculation, CHI Statistics, etc. [10] Aiming at the problem that characters are the basic unit of Chinese sentences, a method for constructing domain stop characters is proposed -called invalid characters in this paper, which is used to segment the question.
Word Embedding transforms the word into a distributed representation of a fixed-length continuous dense vector that contains tremendous information, and the value of each dimension represents a feature with a certain semantic and grammatical interpretation [11]. Therefore, the similarity of text can be expressed by calculating the cosine similarity between word vectors. In [12], the one-hot representation of a word is converted into a low-dimensional vector by CBOW or Skip-Gram, which is used in almost all the natural language processing tasks. However, both character (single Chinese character) and word (word or phrase) is able to use as the language unit for distributed representation in Chinese information processing. Reference [13] prove that the sparse distribution of words leads to problems such as OOV and overfitting, which can be avoid by character vectors, it is also found that the performances of using character vectors always achieve better results in most Chinese tasks.
Inspired by the translation invariance of the word embedding, TransE was proposed in [14], which is a multi-relation-based translation model. Relations in the knowledge graph are regarded as some kind of translation vectors between entities. During the training process, v head v relation v tail is used as the constraint condition, vectors of head entity, relation, and tail entity are constantly adjusted to achieve the embedding of all KG triples. Due to fewer parameters, the computational complexity with TransE is In this paper, we propose a general QA framework ( Figure 1) for Chinese domain knowledge graph, which consists of three parts: question analysis, answer reasoning, and answer generation. First, the question analysis module obtains question entities and relations by shielding invalid characters. Then, the answer reasoning module combines sequential logic and permutation and combination strategies to calculate the TransE embedding of possible entities. Finally, the answer generation module calculates the cosine similarity of candidate answers and entities in knowledge graph, and returns reasoning paths and answer entities to the user.

Question Analysis Based on Invalid characters
The question analysis module identifies the entities and relations in natural language questions to better search and generate answers based on the knowledge graph. Aiming at the problem of less domain data, more proper nouns, fixed question types for domain knowledge graph, this module implements word segmentation by shielding invalid characters, then compares the word embedding with the entity and relation vectors in the knowledge graph, so as to identify entities and relations in the question.
The construction of the domain invalid character dictionary is composed of two parts. First, count the character frequency of a large set of Chinese questions, and select the first 100 high-frequency characters as the general question invalid character dictionary. Second, calculate the entropy value of each character for specific domain knowledge graph: where m is the number of triples, P i w is the probability that character w appears in the i-th triple. The domain invalid character dictionary is obtained by arranging all characters in ascending order of the entropy value and removing the first 1/5 characters which are related to the domain question.
For sequential logical questions, the positions of subject, predicate, and object appear in sequence, and some invalid characters connect them into a complete question. For example, the triple <mo zi hao wei xing yi hao, shi jian fa she (that is "fa she shi jian"), ?> can be got by analysing the question "mo zi hao wei xing yi hao shi shen me shi jian fa she de?", where the entities and relations are connected by invalid characters (such as "shi", "shen", "me", "de", etc.) to form a complete question. Therefore, candidate entities and relations can be obtained by segmenting the question with invalid characters. However, these entities and relations may not be aligned with the entities or relations in the domain knowledge graph. Table 1 shows the problems of head-tail off, mid-segment split, and synonymous expression.

Answer reasoning based TransE
In the reasoning phase of answer, this paper introduces TransE to transform the query of answer into a vector calculation process, which is completely different from the previous generation of SPARQL, Cypher and other logical query statements that rely heavily on manual definitions.
After traversing the question entities and relations in all possible combinations, we can hit on a unary relational form of e h , r or multi-relational form of e h , r 1 , r 2 , … , which is the answer reasoning path.
Since v e h v r v e t , the TransE vectors corresponding to the entities and relations on the answer reasoning path are added or subtracted to get the TransE of tail entity. Then, calculate the cosine similarities of the vector and TransE vectors of all entities in the domain knowledge graph, and select three entities with the highest similarity as the candidate answers in this path.

Answer generation
Through aforementioned steps, each question gets 3M candidate answers, where M is the number of answer reasoning path, and the final answer is determined jointly by probabilities of candidate entities and relations obtained in question analysis module and the generational path of the answer reasoning module. The confidence of each candidate answer is calculated as follows: C a j cos a j ∏ prob e,r e,r∈A j (2) where A j is the reasoning path corresponding to answer a j , prob e,r is the probability corresponding to the entity or relation in the reasoning path A j , cos a j is the cosine similarity between the TransE vector calculated from the answer reasoning path and the corresponding entity in the knowledge graph. The confidences of all answers are sorted in descending order, and the top N results are returned to the user along with their answer reasoning path.

Experiment
This chapter implements a QA system on the knowledge graph of space launch based on the above framework, and perform error analysis with the experimental results. Furthermore, QA systems for the knowledge graph of the Three Kingdoms and the COVID-19-Character verify the scalability of the QA framework.

Dataset Knowledge Graph
This paper collects the space launch records of China over the years and all space launch records at home and abroad over the past three years from http://www.spaceflightfans.cn/. Then, these raw data are cleaned and structured, manually checked to ensure the accuracy of the acquired knowledge. Finally, the knowledge graph of space launch was constructed, including 4127 triples, 1611 entities, and 16 types of relations.
The knowledge graph of the Three Kingdoms and the COVID-19-Character provided by OpenKG are also used to verify the scalability of the QA framework. Among them, the knowledge graph of the Three Kingdoms contains 10459 triples, 5084 entities, and 33 relations; the knowledge graph of the COVID-19-Character contains 991 triples, 759 entities, and 41 relations.
Invalid Character The statistics of 5.8 million questions from Baidu Zhidao yielded 12,874 Chinese characters that fit the long-tail distribution, and selected 100 Chinese characters with the highest frequency as the general invalid character dictionary; for domain question, Equation (1) is used to filter the invalid characters. In this experiment, an 80-character domain invalid dictionary is obtained.

Chinese Character Vector
This framework uses an open-source Chinese character embedding project, which uses Skip-Gram to train Chinese Wikipedia and obtains 20,029 300dimensional character vectors.
TransE Embedding of the domain knowledge graph is based on RotaeE [15], the relevant parameters are set to Epoch = 40,000, Dim = 1000, Batch = 512.
QA Pair Taking the space launch knowledge graph as an example, several question templates are generated based on 16 relations. For example, questions "e h de fa she shi jian shi?", "ni zhi dao e h shi shen me shi hou fa she de ma?", "e h shi na tian fa she de?" can be generated according to the relation "fa she shi jian", where e h is the head entity. In this way, a total of 7736 QA pairs are generated. Inspired by EDA (Easy Data Augment) [16], 5650 questions with one answer are expanded through random deletion and synonymous replacement, resulting in 16,950 questions, and 1/3 of them are randomly selected to evaluate the QA system.

Results of space launch knowledge graph
For each question, the QA system would return five answers with the most confidence and their corresponding answer reasoning paths. HITS@1, HITS@3, HITS@5 (HITS@N means the percentage of top N hits the correct answer) are used to quantify the accuracy of the QA system, with results show in Table 2. As we can see, although the system achieves HITS@1 of 60.60%, HITS@3 has increased by 20.27%, and HITS@5 has further reached 85.61%. At the same time, the results show that our QA system is competitive with the KGQA system in restricted domains [2,4], which is able to meet the needs of different industries. In addition, the reasoning paths of the missed answers in top N are highly correlated with the user question, which plays a relevant recommendation role for users of the domain knowledge graph.
Furthermore, this paper analyses the 813 questions which miss answer in top 5, Table 3 reports the types of errors in each part of question analysis and answer reasoning. In the question analysis module, there are 254 errors (31.24%) due to the wrong question segmentation caused by invalid characters, and 176 errors (21.65%) due to the mismatch of cosine similarity caused by the proximity of character vectors. As mentioned in section 3.2, there are problems of head-tail off, mid-segment split and synonymous expression, although character vectors can partially solve these problems, some wrong segmentations that are too ridiculous to be correctly matched based on the character vectors; besides, vectors of similar characters indicate the phenomenon of aggregation, such as "shi jian shi yi hao 01 xing", "shi jian shi yi hao 04 xing" and "shi jian shi yi hao 06 xing", the correct entity cannot be got by the cosine similarity of character vectors, which would affect subsequent steps. In the answer reasoning module, there are 240 errors (29.52%) due to the wrong reasoning path caused by permutation and combination, and 143 errors (17.59%) due to the mismatch of cosine similarity caused by the proximity of TransE embeddings. Although two kinds of permutation and combination strategies of sequential logic and distinguishing between entity and relation are proposed in this paper, it is not enough for more complex questions or ambiguous questions, which require stronger semantic understanding or reasoning ability to learn the reasoning path; in addition, similar to character vectors, aggregated TransE embeddings cannot rely solely on cosine similarity to accurately match the answer.

Scalability analysis
To verify the scalability of the QA framework in this paper, we use it to implement QA systems for the knowledge graph of the Three Kingdoms and the COVID-19-Character separately. Table 4 reports that the QA framework has good scalability and can meet the needs of QA system for different domain knowledge graphs. The QA system performs well in small and medium knowledge graphs, but the accuracy decreases slightly with the scale of knowledge graph increases. The possible reason is that relations in larger knowledge graphs are more complicated, and there are high-level relation modes such as symmetry or antisymmetry, inversion, and composition, which put higher requirements on the intelligence of the QA system. Table 4. QA results of different domain knowledge graphs.

Conclusions
This paper proposes a general QA framework for the Chinese domain knowledge graph, which parses the question with domain invalid characters and queries the answer with TransE embedding. The experiments on different Chinese domain knowledge graphs verify the feasibility and scalability of the framework. In the future, we plan to combine invalid characters and named entity recognition based on pre-trained language models, and apply improved models such as TransH or TransD to solve 1-N and N-N problem in QA pairs.