Considering Sub-graph Structure in Representation Learning

Representation learning aims to encode knowledge graphs into a low-dimensional vector space. However, since both the out-degree and in-degree of the entities in the knowledge graph follow the power-law distribution, only a small number of entities with higher frequency play a key role in training process, while the others have less effect. This leads to a more serious problem of data sparsity. In this paper, we propose two knowledge graph representation learning models for sub-graph structure fusion. Based on the original model, our models are trained by incorporating the sub-graph structure during training process. By incorporating the sub-graph information into the representation learning model, the structural associations between different relations can be incorporated while considering the semantic association between entities. So that the entities and relations can be modeled more accurately, and the data sparsity problem of the knowledge graph representation learning model can be alleviated effectively.


Introduction
Google formally proposed the concept of knowledge graph and introduced it into search engines [1] On May 17, 2012. Knowledge graphs is essentially a huge semantic network, expressed in the form of triples (head entity, relation, tail entity) (short for (h, r, t)). For instance, the triple (Hemingway, author, the old man and the sea) illustrates the fact that the author of the novel "the old man and the sea" is Hemingway. In recent years, a large number of knowledge graphs have been constructed, such as YAGO [2], Freebase [3] , DBpedia [4] , DeepDive [5] , etc. Tasks like Question Answering System [6][7] , Web Page Ranking [8] and other downstream natural language processing tasks can be solved by Large-scale knowledge graphs. Introducing knowledge graphs for large data analysis is a research hotpot in many vertical industries, such as finance and health care. However, like other types of large-scale dataset, large-scale knowledge graphs follow long-tailed distribution, and with serious data sparsity problem. Therefore, the research on knowledge graph completion technology is of great significance.
As shown in figure 1, it can be found that the out-degree and in-degree of entities in the knowledge graph follows power law distribution through the statistics of FB15k dataset. This shows that the amount of information contained in the triple in the knowledge graph inclines seriously. Only a few entities with high frequency play a key role in training process, while the others have less effect, which resulting in more serious data sparsity problem. In addition, statistics show that there is a shorter path between any two entities, and the average length of the shortest path is about three. If the above characteristics are utilized and the high-order structural information is incorporated into the training and learning process, data sparsely problems can be effectively alleviated. Furthermore, there are many sub-graph structures appearance in semantic networks frequently, in which two, three, four-point sub-graph structures are common. In the knowledge graph, the high-order sub-graph structure contains not only the association information between entities, but also the association information between relations. It is possible to obtain a more accurate vector representation of entities and relations in the vector space via sub-graph structure.
The triples in the knowledge graph can be regarded as a two-point sub-graph structure, and only a single relation contained in the triple. If we want to learn the relation between different relations, a higher-order sub-graph structure for more information are needed. Three-point sub-graph is the most common high-order sub-graph with relatively simple structure. Since the average shortest path length between any two entities is about three, it is better to capture the relations among entities and relations by the four-point sub-graph structure in the knowledge graphs. It can be considered that the higher-order sub-graph structure has denser data volume and more abundant information. Although such a sub-graph structure does not contain all the information in the knowledge graph, but it expresses more abundant structural features to a certain extent.

Our Model
In addition to the existing two-point sub-graph structure, we take two other sub-graph structures into consideration: three-point sub-graph structure and four-point sub-graph structure. The knowledge graph representation learning based on sub-graph structure takes full account of the association information between entities and path information between relations, which provides more abundant constraint information for the model. For example, according to the path of a relation: "father + father" between "Li Dan" and "Li Zhi", extracted from (Li Dan, father, Li Xian) and (Li Xian, father, Li Zhi), combining with the triple group (Li Dan, grandfather, Li Zhi), the three-point sub-graph structure additionally provides the implied Relevant information, a new relation: "father + father = grandfather". In addition, supposing that there is a triple (Li Zhi, Father, Li Shimin) in the knowledge graph and another triple (Li Dan, Great Grandfather, Li Shimin), in fact, the four-point sub-graph structure provides more abundant information about the relation, that is, "Father + Father + Father = Great Grandfather". Obviously, subgraph structure implies the transferability between relations. Therefore, if we can find an appropriate method to model sub-graph structure, it will greatly improve the accuracy of knowledge representation.
Generally speaking, sub-graph structure includes rich semantic association between nodes, interactions between nodes and relations, as well as semantic relations between relations. For simplicity, this paper chooses the simplest sub-graph structure, that is, the three-point sub-graph structure and the four-point sub-graph structure in which relations point to the same. Moreover, we choose the method which best conforms to the translation hypothesis between vectors in the triple to combine multiple relational vectors. As shown in figure 2, for sub-graph (a), the constraint is . For sub-graph

Knowledge Atlas Representation Learning Model Fusing Three-Point Sub-graph Structure
Based on the framework of the classical representation learning model TransE [9] , we proposes a knowledge representation learning model (TransT) which integrates three-point sub-graph structure. In the TransE model and the TransT model, the entity set and the relation set are denoted as E and R, respectively, and a knowledge graph is represented by a triple set. Then, starting from each triple, we can find three-point sub-graphs related to each triple and form a set of three-point sub-graphs. Among them, e denotes entities, r denotes relations, i, x and j denotes the numbers of entities, ij, ix and xj denotes the relations between entities numbered i and j, i and x, and x and j, respectively. For each triple (h, r, t), TransE regards relational vector r as a translation operation between head and tail entity vectors. For each established triple, a smaller score means better model, while for the non-established triple, the score has to be as large as possible. In this section, we propose a knowledge graph representation learning method which can be used for integrating three-point sub-graph structure. We hope that by learning the three-point sub-graph structure, the energy score of the established three-point sub-graph is as small as possible, and the non-established three-point sub-graph is as large as possible. Based on TransE, the proposed model will be integrated into the three-point sub-graph structure of knowledge graph to improve the accuracy of representation of learning entities and relational vectors. The score function of TransE is shown in formula 1.
Obviously, TransE only considers a two-point structure which consisting of two entities and one relation. The proposed representation learning model incorporating three-point sub-graph structure which contains three entities and three relations, as shown in figure 2(b). The scoring function as shown in formula 2.

Knowledge Atlas Representation Learning Model Fusing Four-Point Sub-graph Structure
Based on the framework of the classical representation learning model TransE, we propose a knowledge representation learning model (TransF) that integrates four-point sub-graph structure. In TransE and TransF model, the entity set and the relation set are denoted as E and R respectively, and a knowledge graph is represented by a triple set. Then, starting from each triple, we can find four-point sub-graphs related to each triple and form a set of four-point sub-graphs. Where e denotes entities, r denotes relations, i, x, y and j denotes the numbers of entities, ij, ix, xy and yj denotes the relations between entities numbered i and j, i and x, x and y, and y and j, respectively. In this section, we propose a knowledge graph representation learning method that integrates four-point sub-graph structure. We hope that by learning four-point sub-graph structure, the energy score of the established four-point sub-graph is as small as possible, and the non-established four-point sub-graph is as large as possible at the same time.
The proposed model in this subsection will be integrated into the four-point sub-graph structure of knowledge graph on the basis of TransE to improve the accuracy of the representation of entities and relational vectors. The scoring function of TransE is shown in formula 4.

Evaluation
In order to verify the performance of the knowledge graph representation learning method proposed in this paper, two typical knowledge graph completion tasks link prediction and triples classification will be tested on FB15k.  The result of the top 10 percentage of entities in link prediction are shown in table 1. From the table 1, we can see that TransT (Raw) is 7.9% higher than TransE (Raw), and TransF (Raw) is 8.8% higher than TransE (Raw). TransT (Filt) is 20.9% higher than TransE (Filt), and TransF (Filt) is 21.5% higher than TransE (Filt). The results show that the representation learning models TransT and TransF proposed in this paper are significantly better than TransE in entity link prediction tasks, and the results of TransF are better than TransT. The results of the classification accuracy of triples classification are shown in table 2. TransT is 7.4% higher than TransE, while TransF is 7.7% higher than TransE. The results indicate that the representation learning models TransT and TransF proposed in this paper are significantly better than TransE in terms of triples classification tasks, and TransF is better than TransT.