Path Aggregation Model for Attributed Network Embedding with a Smart Random Walk

Network embedding has attracted a surge of attention recently. In this field, how to preserve high-order proximity has long been a difficult task. Graph convolutional network (GCN) and random walk-based approaches can preserve high-order proximity to a certain extent. However, they partially concentrate on the aggregation process and sampling process respectively. Path aggregation methods combine the merits of GCN and random walk, and thus can preserve more high-order information and achieve better performance. However, path aggregation framework has not been applied in attributed network embedding yet. In this paper, we propose a path aggregation model for attributed network embedding, with two main contributions. First, we claim that there always exists implicit edge weight in networks, and design a tweaked random walk algorithm to sample paths accordingly. Second, we propose a path aggregation framework dealing with both nodes and attributes. Extensive experimental results show that our proposal outperforms the cutting-edge baselines on downstream tasks, such as node clustering, node classification, and link prediction.


Related Work
Plain Network Embedding. Network embedding can be traced back to matrix factorization methods [15,16], which utilize eigen-decomposition to preserve the adjacency matrix. Then, LINE [2] applies two different objective functions to capture the first and second order proximity. SDNE [4] extends LINE by applying a deep autoencoder to preserve node proximity. DeepWalk [1] and Node2Vec [3] utilize random walk to sample global nodes and feed them to language models to get the embedding. DNE [20] proposes a degree-biased random walk, aiming to better capture the global structure. However, none of these random walk strategies takes the influence of dominant nodes into consideration.
Attributed Network Embedding. SNE [21] applies a multi-layer network to fuse the structure and attributes information. AANE [17] utilizes symmetric matrix factorization to represent the attribute similarity. CANE [7] extends LINE to learn local structural and textural information. However, these approaches can only capture the local information. TADW [6] utilizes text-associated random walk to incorporate text features. MMDM [22] applies a max-margin random walk to sample global nodes according to the label information. TriDNR [23] separately learns embeddings from the structurebased random walk and label-fused Doc2Vec attribute model. GCN-based methods use graph convolutional networks for network embedding [9], and learn the global information by aggregating the features of neighbor nodes. GraphSAGE [24] extends the GCN framework to the inductive setting. The variants GAE [11], VGAE [12], and AGE [10] comprise a GCN encoder and a reconstruction decoder.

Proposed Method
We first give the notations and define the problem, and then introduce our novel attributed network embedding model.

Notations
The attributed network can be represented as . is the node set where n is the number of nodes. The set of neighbor nodes for is denoted by . We denote by the one-hot ID vector of node . is the adjacency matrix and is the attribute matrix. If there exists an edge between and , . Otherwise, .
is the attribute vector of node . The objective of network embedding is to cast each node to a vector (where ).

Overall Framework
We propose a path aggregation model for attributed networks, with a tweaked random walk algorithm to sample paths. As is illustrated in Figure 1, there are three modules in our proposal, (1) the local fusion module for fusing the local topology and attributes information of each node, (2) the random walk sampling module for sampling paths, i.e. the high-order neighbors, and (3) the path aggregation module for aggregating the high order neighbors and preserving high-order proximity.

A Tweaked Random Walk Revealing Implicit Edge-Weight
Random walk can be seen as a sampling process of the high-order neighbors. For traditional random walk strategy as proposed in DeepWalk [1], the next node is chosen randomly from the connected nodes. Then, some follow-up works improve the original random walk to sample nodes efficiently, i.e. LINE [2], node2vec [3], and VRRW [25]. The existing random walk strategies sample paths according to the topology of a network, which treats each edge equally. In the following, we claim that there exists implicit edge-weight in a plain network, and propose a tweaked random walk accordingly. Implicit Edge-Weight & Node Degree. According to our observation, in real world networks, the influence of each node is vastly different. Some nodes are dominant and are more likely to affect other nodes. As a result, for one target node, the edge of a dominant node always shows larger influence than that of a non-dominant node. For instance, in a citation network, one literature (target node) cites many other literature (edges), but the influence (edge weight) of each reference is vastly different. The target literature is more likely to be influenced by the highly-cited literature in the references, and share its idea. The relation of two nodes is always much more complex than an unmarked edge, which provides only one bit information. We claim that there always exist implicit edge weights in a network, representing the different influence of one node to another node. In this paper, we take the underlying difference into consideration. In this paper, we consider the nodes with more neighbors as the dominant nodes. Namely, the implicit edge-weight is determined by the node degree .
Degree-Based Random Walk Revealing Implicit Edge-Weight. Given that the edge connecting with a node of a larger degree have a larger implicit weight, random walk strategy can be tweaked by increasing the transfer probability [25,20] to nodes with larger degrees. Accordingly, we propose a novel random walk. For each pair of nodes connected by an edge , we define a transfer weight to represent the possibility of transferring from to , which is calculated as follows, As and represent the current node and one candidate of the next node respectively, represents the last node prior to . and represent the degrees of and , respectively. denotes the shortest distance between node and node . is the smoothing parameter. is the encouraging parameter. According to Equation (1), encourages further walk steps, where and are not connected. The following briefly describes the two key points of the proposed random walk strategy: Impact of Implicit Edge Weight. We increase the transfer probability according to the implicit edge weight. The equation is an increasing function of , which we regard as the impact of implicit edge weight on the transfer probability. For all the candidate nodes, the one with a larger degree is more likely to be chosen.
Mechanism to Avoid Stuck Steps. We use an encouraging mechanism to make the random walk go further. If applying the random walk without , the walk steps would be stuck near the dominant nodes, and capture same nodes repetitively. Then, it can hardly go further to capture more global information. Therefore, we encourage the further step to with . The final transfer probability is calculated by normalizing the transfer weights of neighbor nodes in , namely .

Local Fusion Module
The local fusion module fuses the local topology and attributes information, outputting the local embedding vector . Embedding Layer. The node ID vector and attribute vector are first projected to dense vectors and through the embedding layer, i.e.
, where and are weight matrices. and denotes the embedding sizes of the node ID and attribute respectively. It is noteworthy that the topology information is preserved by adding node ID vector. If we remove node id embedding, the nodes with same attributes will get same local embedding vectors. In this way, we only represent the attributes information of the node, without the structure information. Although the introduced node id does not contain structure information itself, it provides trainable dimension for structure information of different nodes. The actual local structure information comes from the first-order proximity in the loss function. After several epochs of training, the local structure information (in the loss function) of each node will feedback to change the original node id. The idea of utilizing node id to represent local structure has already been utilized in model ASNE [21].
Hidden Layers. After the embedding layer, the obtained ID vector and attribute vector are fed into a Multi-Layer Perception (MLP). The hidden representations for each layer are denoted as , which are defined as follows, where denotes the concatenation operation, adjusts the importance of attributes, denotes the activate function, and is the number of hidden layers. The final output is the local embedding vector .

Path Aggregation Module
The global embedding vector depicts the impact of high-order neighbor nodes on node . The path aggregation module aims to obtain the global embedding vector by updating it iteratively. The aggregation begins with an initial value, which is calculated as follows, Then, we update the initial value by aggregating the sampled nodes. After each random walk sampling, we obtain a path , where denotes the source node and ( ) denotes the (t-1)-th step. We aggregate the path nodes without , namely . In the following, we introduce three candidate aggregators.
Mean Aggregator. The first candidate aggregator function is the mean operator, which simply takes the element-wise mean of the vectors, Linear Aggregator. This aggregator is an extension to the mean aggregator by appending it with a linear transformation, LSTM Aggregator. We also examine a more complex aggregator based on an LSTM architecture, Compared to the mean aggregator, LSTMs have the advantage of larger expressive capability, and they can preserve the order of a sequence. It is noteworthy that we input the path to LSTM in reverse order.
For the LSTM aggregator, we introduce an extra task to help with the training of LSTM [26]. Specifically, we aim to maximize the likelihood of predicting the next node , where represents the input sequence of LSTM, and represents the next node sequence. For each input sequence, the goal is to predict its next node, so we use cross entropy to evaluate the loss, where denotes the output of the t-th step, and denotes the input of the (t+1)-th step. is the weight matrix, where denotes the hidden size of LSTM. Then, we update the initial value with the aggregation recursively, according to the following loss function, where denotes the 2-Norm of x. Finally, the obtained is the global embedding vector .

Model Optimization
The embedding model is able to capture the high-order proximity, but we should also strengthen the first-order proximity and guarantee the connected nodes still close with each other in the new embedding space. Accordingly, we use the following loss function as in LINE [2], where denotes the set of connected node pairs, represents the set of negative sampling, and represents the sigmoid function. The final loss function is calculated as follows, After training the global and local embedding vectors for each node, we obtain the final embedding vector by combining the two parts with weight coefficient ,

Experimental Results
We empirically validate the effectiveness of the new approach PANE in comparison to the cuttingedge baselines, on three network analysis tasks, i.e. node classification, node clustering and link prediction. Moreover, we implement ablation experiments to validate the effectiveness of the novel random walk strategy.

Experimental Setup
Datasets. Three widely used attributed networks datasets are adopted, i.e. Citeseer, Cora, and Pubmed [27]. The statistics of the three datasets are summarized in Table 1.
Parameter Settings. To ensure fairness, we set the final embedding dimension to 128 for all the methods and datasets. For all baselines, we set the optimal parameters as suggested in their original literatures. In our approach PANE, we use two layers of fully connected neural network in local fusion module. In specific, 256 units are applied to the first layer and 128 units are applied to the second layer. We set ϵ as 1, θ as 2, and utilize grid search to find optimal parameters for different aggregators. The initial learning rate of Adam optimizer is set to 0.0001.
Evaluation Metrics. In node classification, we use micro-average and macro-average as the evaluation metrics. To measure the performance of node clustering methods, we employ two metrics: Accuracy (ACC) and Adjusted Rand Index (ARI). For link prediction, results are evaluated with the area under curve (AUC).

Node Classification
In node classification, our goal is to classify each node into one of the multiple labels. We first randomly sample a portion (i.e. 5%, 15%, 25% in this paper) of the labeled nodes, and guarantee these selected nodes cover all classes. Then, after obtaining the network embedding, a one-vsrest logistic regression classifier could be trained and predicts the remaining labeled nodes. We repeat this process 10 times and report the mean value of the Macro-F1 and Micro-F1 scores.
The results of node classification on three datasets are shown in Table 2, 3, and 4 respectively. As shown, on Cora and Pubmed, PANE outperforms all the baseline methods. On Citeseer dataset, PANE achieves comparable results to AGE, but still outperforms other baseline methods. In particular, PANE outperforms the random walk-based baseline methods (i.e. CANE, DGENE) on three datasets with a significant improvement.

Node Clustering
In node clustering, we aim to assign distinct cluster to each node. After obtaining the embedding vectors, we applied k-means algorithm to the embedding vectors of nodes to cluster them into classes.
As the results of node clustering fluctuate a lot in different epochs, we regard the best score of each method as the final result.
The results are shown in Table 5. As shown, our proposal PANE consistently achieves significant improvement comparing to all the baselines on all three datasets in terms of both ARI and ACC. It verifies that PANE has the capability of modeling relationships between nodes precisely. In particular, competing with the strongest baseline AGE, our model outperforms it by 10.12% in terms of ARI and 3.76% in terms of ACC on average.

Link Prediction
In link prediction, we aim to assess the ability of node embedding by reconstructing the link between nodes. We follow the widely adopted way. For each dataset, the edges are randomly divided into three groups. 85%, 5% and 10% of the edges are utilized in training, validation (hyper-parameters tuning), and performance testing, respectively. As the test/validation set contains only positive instances, we randomly sample the same number of non-existing links as negative instances, and rank both positive and negative instances according to the prediction function. For each network, experiments are repeated 10 times on 10 different random edge partitions, and the average performances are recorded.
The results are shown in Table 6. As shown, compared with the baseline methods, our proposal PANE outperforms the baselines on all three datasets. In particular, compared with the strongest baseline AGE, PANE outperforms it by 1.27% on Cora, 2.19% on Citeseer, and 8.70% on Pubmed.

Conclusion
To better utilize the high order information of attributed networks, we propose a path aggregation model for attributed network embedding (PANE) with a tweaked random walk. PANE strengthens the leveraging of high order information from two aspects. On one hand, PANE strengthens the path sampling by proposing a tweaked random walk strategy, which can capture the nodes with more high order information preferentially. On the other hand, PANE strengthens the aggregation process by introducing a path-aggregation framework, which is rarely used in the field of attributed network embedding. By fully leveraging the rich high order information, PANE achieves competitive performances and outperforms the cutting-edge baselines in most cases.