Extraction of Space Domain Entity and Relation via Word Vector Representation and Clustering Method

Knowledge graph has shown great value in search engine, natural language Q&A, recommendation system and other application scenarios in recent years. The basic elements of a knowledge graph are entities and relations therein, so how to automatically extract entities and relations from natural language texts becomes a key issue in knowledge graph construction. In this paper, we propose an unsupervised method to extract space domain entities and relations with the goal of building a space knowledge graph. Firstly, a neural network model is used to extract implicit semantic features of domain words represented by dense vectors from original space domain corpus, and then new entities are discovered by clustering in vector space through a small number of labeled data. By concatenating space domain-specific word vectors and general domain word vectors, universal vector representations of entities are obtained, which include general features and domain features as well. On this basis, semantic vectors of relations between entities are calculated, and more new entity relations can be extracted from the corpus by using semantic vectors of relations. Compared with supervised method, the entity and relation extraction method proposed in this paper only needs a small amount of labeled data, thus is quite suitable for the construction of knowledge graph in space domain where labeled data is rather rare and expensive.


Introduction
Knowledge graph is a kind of technology that representing human knowledge in a form of structured triplets R(head entity, relation, tail entity) interconnected one another so that machines can calculate and analyze knowledge more easily. Google put forward the concept of knowledge graph in 2012 in order to improve intelligence of its search system, and achieved great success. Since then, with deepening of research, knowledge graph has become one of the most thriving artificial intelligence technologies. At present, the scale of knowledge graph in general domain has exceeded one billion, but the development of knowledge graph in specific domain is relatively backward.
Space is a thrilled domain for human beings. To integrate knowledge graph technology into space domain has the prospect to increase intelligence of space systems, which is very helpful in many space mission such as Mars exploration and deep space exploration. In this paper, space knowledge graph is proposed in order to facilitate a more smart space information system. A main step to construct a space knowledge graph is to recognize named entities and extract relations in space domain.
Named entity recognition and relation extraction are procedures to get entities and relations from natural language texts, through which structured knowledge triplets are formed. There are two main types of methods to extract named entities and relations. One is based on supervised method and the other is unsupervised method. Much attention has been paid to the study of supervised method [1,2]. Guillaume Lample et al presented a bidirectional LSTMs and conditional random fields model for named entity recognition [3]. Jason C used a hybrid LSTM and CNN architecture to recognize named entity [4]. Dong C et al applied a LSTM-CRF neural network that utilizes both character level and radical-level representations to recognize named entity [5]. Makoto M et al study relation extraction using recurrent neural network LSTMs [6]. Compared to supervised methods, an unsupervised method has the advantage of being more economical and feasible. Thus in this paper, an unsupervised method is adopted. Especially, a vector representation model of space domain vocabulary is designed, through which space domain-specific entities and relations can be extracted through clustering in certain vector space.

Methodology
In this section, we expatiate our unsupervised method. We first introduce the vector representation model designed to get features of space domain vocabulary, and then elaborate procedures to recognize and mine domain-specific entities and relations from raw corpus based on pre-trained vector representation model.

Vector representation model of space domain vocabulary
In order to extract domain entities and relations from original texts of space domain, firstly, space domain words need to be expressed as vectors as the input of subsequent entity and relation extraction model. In this subsection, we design a vector representation model of space domain vocabulary, through which semantic and conceptual features of the vocabulary can be acquired from the collected original corpus in space domain.
One-hot vector coding method is used in traditional vocabulary representation. However, this method has many defects. One is that the coding is extremely sparse while the dimension size is huge. The dimension size of each word vector is equal to the number of words in the vocabulary while there is only one non-zero element. Besides, all the word vectors are irrelevant, that is, word vectors contain no semantic information. Thus, dense vectors with relatively small dimensions are used to encode words in our study. Each dimension of the word vector measures semantic and conceptual features of the corresponding word in a certain aspect. Artificially defined features is difficult to achieve completeness, thus a neural network model is established to obtain the vector representation of words through training, as is show in figure 1. In general domain, there are word2vec, glove, Bert and other word vector models. The data used to train these models are from general corpus. However, space domain is a professional field, and its semantic and lexical features have certain domain specificity.
If we use the general domain model only, a large part of space domain-specific information is not taken into account and the result will not be satisfactory. Therefore, the approach we use is to train a space domain vector representation model and fuse it with the general domain model to get a complete vocabulary representation. The training process is as follows: 1) Collect enough accurate and credible raw texts in space domain to form corpus c S .

2) Obtain domain lexicon
where i x is the vector representation of the ith word in X and and m is the total number of all known features.The expected output of the model, Y corresponding to X is the 2k words on the left and right sides of each word in X , i.e., the input-output training data pairs are The output of hidden layer is: The output layer is: where ] , , , Cross entropy is used as the optimal loss function of training: The optimal value of weight matrices

Unknown entities recognization model
The obtained word vector contains abundant semantic and lexical information of entities, and all the word vectors constitute a characteristic vector space in space domain. The metric distance between entities with similar semantics must be closer than that between entities with unrelated semantics. The more similar two words are in one respect semantically, the closer they are in the vector space. According to these principles, we can obtain a large number of unknown entities based on a small number of known entities. For entities composed of a single word, take a known entity as the center of a hypersphere with radius R, find out all the words within the radius R from the center of the sphere, then it is very likely that these words represent entities of the same type as the known entity. To improve probability, unknown entities can be retrieved through clustering method. In particular, T is defined as a set of entities of a certain category, such as "satellite" or "rocket". If there are n known entities of the same category T , i.e., . The entities are represented by word vectors Where symbol '  ' denotes 'be represented by'. Define Then the corresponding entity that x o W represents also belongs to T , in which q is a confidence index to measure credibility of the acquired entity.

Entity relation extraction procedure
After the entity is identified, the relation between entities can be mined by using the relation extraction model. The usual relation extraction model needs a lot of labeled data, which is very expensive. Using clustering method, more relations of the same category can be extracted with a little labeled data. The detailed procedure is as follows: firstly, the space domain-specific word vector that constitutes the entity is spliced with the general domain word vector, such as output of the Bert model, to get the universal representation of the entity, The reason why to add the word vector from general domain is that there are more various corpora in the model training process of the general domain, and its word vector may contain general semantic features that the domain-specific word vector does not have. By joining two types of word vector together, they can complement each other and express more complete information of entities.
With the complete vector representation of entity, relations can be extracted naturally. For a certain type of triple relation R (head, tail), assuming n instances are known. For the ith instance, the universal vector representations of head and tail entity are  W , the tail entity vector can be predicted as: In the formula, it is not required that both sides of the equal sign are strictly equal, but only satisfies that: where R  is a minimal value set in advance. Similarly, if the tail entity is known, the head entity can be obtained as:

Case study
The main concepts in space domain include spacecraft, rocket, celestial body and so on. For entities belonging to the same type ( e.g. satellites), since they have same relation type with other entities as well as same property type, the contextual words around them in human natural language texts are similar naturally. By training the vector representation model to the full proposed in section 2 based on these space domain-specific literatures, certain dimensions of the word vector will reflect this kind of similarities, which leads to clustering of the same type of entities in certain sub space of the word vector space, as is illustrated in figure 3. The exact dimensions depicting the feature of different categories of space entity is usually not known in advance, thus the selection of hyper parameters q in equation (11)(13)  The larger the value of q , the higher the credibility will be, but correspondingly, the number of candidate entities retrieved will be reduced. It is difficult to determine whether the category characteristics of an entity are determined by a single dimension or by multiple dimensions, the value of q can be determined by experiment iteratively.

Conclusion
In this paper, we propose an unsupervised method to extract entities and relations in space domain for construction of a space knowledge graph. The key contribution of our method is to design a vector representation model to simultaneously encode implicit and overt features of space domain-specific words into a dense vector space. By clustering word vectors in the vector space through calculation of vector norms, entities of different categories can be recognized and retrieved. A universal representation model integrating general domain and space domain-specific features of word vector is also proposed to extract relations through algebraic operation in the vector space. Compared with supervised methods, the proposed method in this paper has an advantage to be easily extended to the construction of various other domain-specific knowledge graphs where labeled data is either expensive or scarce.