Spatial-temporal Traffic Flow Prediction Model Based on the GAT and BiGRU

Real-time and accurate traffic flow prediction is crucial for improving the safety, stability, and efficiency of intelligent transportation system. Considering that traffic flow prediction methods rarely analyze from the perspective of the road network, in this paper, a spatial-temporal traffic flow prediction model based on the combination of graph attention network (GAT) and bidirectional gated recurrent unit (BiGRU) neural network is proposed. Firstly, GAT is used to analyze the complex topology of the road network, effectively obtaining the spatial features of the road network. Secondly, BiGRU is used to learn the dynamic changes of traffic flow data, effectively obtaining the temporal features. Thirdly, the obtained spatial-temporal features are output by the fully connected layer to complete the prediction of future traffic flow. Finally, the model is validated and evaluated on the California highway dataset. The experimental results show that the accuracy of GAT-BiGRU model is better than other benchmark models in predicting future traffic flows transformation, especially in long-term prediction.


Introduction
As the number of cars on the road rapidly increases, traffic congestion has become an urgent problem that needs to be addressed.Traffic flow prediction, as an important component of Intelligent Transportation System (ITS) [1] , has become one of the hot research topics.In recent years, researchers have proposed many traffic flow prediction methods.These methods can be broadly classified into two categories: model-driven methods and data-driven methods [2] .
Model-driven methods mainly consist of parameters that their model structure is determined by theoretical assumptions and the parameters are calculated based on empirical data.Commonly used methods include the Auto-Regressive Integrated Moving Average (ARIMA) model [3] and the Kalman Filtering (KF) model [4] , etc. Compared with model-driven methods, data-driven methods show better flexibility.Data-driven methods can be further classified into two categories: traditional machine learning models and deep learning models.
Traditional machine learning models parameters are often adjusted based on adaptive learning, such as Support Vector Regression (SVR) [5] , K-Nearest Neighbor (KNN) models [6] , etc.Although the problem of poor performance of traditional methods can be solved by these models, high-precision prediction results can lead to high training time costs, and also require specific data sample requirements [7] .In addition, traditional machine learning models also lack the ability to handle high-dimensional data, so their applicability is relatively weak.Deep learning is a popular research direction in recent years.It can handle multi-dimensional, non-linear data, so deep learning models perform better in predicting traffic flow.For example, spatial correlation do not considered in Stacked Auto Encoding (SAE) proposed by Lv et al [8] and Deep Belief Nets (DBN) proposed by Huang et al [9] , although they can learn the correlation of traffic flow sequences and capture complex temporal features.To achieve more accurate traffic flow prediction, various deep learning combination models have emerged.For instance, CNN-LSTM model proposed by Yao et al [10] , A3T-GCN model proposed by Zhu et al [11] , GCN-BiLSTM model proposed by Wu et al [12] and STGCN model proposed by Guo et al [13] .
In summary, deep learning models have become the mainstream method for traffic flow prediction research.The existing traffic flow prediction models seldom consider the spatial and temporal characteristics of road networks, so a new deep learning model is proposed in this paper, i.e., the GAT-BiGRU model.

Problem description
In the traffic network, there are direct or indirect connections between road sections.In addition, there are also interactions between traffic entities, which will result in spatial-temporal correlation in adjacent road sections.The problem of traffic flow prediction can be viewed as learning historical traffic flow sample(i.e., the traffic sequence values of the traffic network G in period T as input), and then obtaining a function F to predict the traffic information for the next time step s by training.The historical traffic flow sample can be represented as X=(xt, xt-1,..., xt-T+1; G), and the future predicted traffic flow sample can be represented as Y=(yt+1, yt+2,..., yt+s).The function F is shown in equation (1):

Graph attention network
Graph Attention Network (GAT) [14] , as a novel neural network model based on graph-structured data, it can assign weights to different neighbor nodes based on their importance, using hidden self-attention layers to make up for the shortcomings of graph convolution or its approximate methods.By stacking layers, nodes participate in the features of their neighborhoods and weights can be assigned to different nodes in the neighborhood without performing complex matrix operations and understanding the graph structure.
The calculation process of GAT is shown in Figure 1: In the GAT calculation process, it is necessary to transform the input features into higher-dimensional features.Firstly, a shared linear transformation W parameterized by a weight matrix is applied to each node h.Then, self-attention is performed on the nodes (i.e., sharing the attention mechanism), and the attention coefficients eij of the nodes and their neighboring nodes are calculated.Finally, the soft max function is used to perform normalization so that the coefficients of different nodes can be readily compared.
Multi-headed attention mechanism is introduced to enrich the model and stabilize the training process in the GAT.The computation process of multi-headed attention for a single node is shown in Figure 2: The obtained normalized attention coefficients are used to calculate the linear combination of the corresponding features h1 m (the features of node 1 at layer m is calculated in the figure) and are used as the final output features h1 m+1 (the features of node 1 at layer m+1) of each node.Two methods can be used for information aggregation with attention mechanisms: one is concatenation (i.e., concat).The other is an average operation (i.e., avg).

Spatial-temporal traffic flow prediction model based on GAT and BiGRU
Traditional convolutional neural network is only suitable for processing Euclidean spatial data.But graph representation in the GAT can be used to process non-Euclidean spatial data, so it can improve the applicability of traffic network data in a way.In fact, traffic flows on the road network are not only influenced by the traffic flow size of adjacent road segments in space, but also by historical traffic flows in time.Therefore, the GAT (which is able to efficiently and quickly extract spatial features of the road network) and BiGRU (which is able to extract traffic flow temporal features) are combined in this paper, and a traffic flow prediction model for road networks is proposed, called GAT-BiGRU.

Spatial feature extraction model
Attention mechanism in the GAT is used to extract spatial features of road networks.In this paper, the last step of spatial feature extraction is to update the hidden features of the nodes.To better understand this operation, the adjacency matrix is introduced to map the attention coefficients (weights assigned based on the importance of nodes) to the adjacency matrix.The adjacency matrix varies dynamically with time and the expression is shown in equation ( 2): Where denotes the attention coefficient, which indicates the degree of influence between node i and node j.ANt denotes the attention coefficient matrix formed by N nodes in the road network at time t.
The multi-dimensional variables corresponding to the attention coefficient matrix are shown in equation (3): Where T denotes the length of the time series, N denotes the total number of nodes in the traffic network, and A denotes the collection of attention coefficient matrices of N nodes within T time steps.

Temporal feature extraction model
In recent years, RNN have been widely used for processing time series data.GRU, as an improvement of the RNN model, effectively alleviates the problems of gradient disappearance and explosion in traditional RNN models.GRU consists of a reset gate and an update gate.Simply put, when the update gate is larger, more information from the previous time step is retained, and when the reset gate is larger, less information from the previous time step is discarded.In this paper, BiGRU are used to extract temporal features of traffic flow.The structure of BiGRU network is shown in Figure 3, which mainly consists of four parts: input layer, forward hidden layer, backward hidden layer and output layer.The input layer contains the data to be input, and the data will be passed to both the forward hidden layer and the backward hidden layer at each moment, which means that the data will flow to the GRU network in two opposite directions at the same time, so the output result of the output layer is decided by both GRUs together.Assuming that xt is the input vector at moment t, the computational process of GRU network can be expressed as follows: ) ( ) ( Thus, the network structure of BiGRU can be expressed as follows: Where t h  and t h  denote the states of the forward and backward hidden layers at moment t, respectively; At and Bt denote the weights of the states of the forward and backward hidden layers at moment t ,respectively; and bt denotes the bias of the state of the hidden layers at moment t.

The GAT-BiGRU combined model
The characteristics of forward historical traffic flow data can be captured by the GRU model, but in fact, traffic flow data is also affected by related factors such as driver operations.Therefore, in this paper, the BiGRU model is selected to obtain forward and backward information for each node.Since overfitting is prone to occur when using the BiGRU model for temporal feature extraction, it is necessary to add a Dropout layer to the network, so that learning errors can be propagated correctly along the time axis during the training process.
The algorithm of GAT-BiGRU based traffic flow prediction model is shown in Table 1: Table 1.Algorithm steps.

Input: sequence X Output: prediction result Y
Step1.The raw data is pre-processed to obtain the input sequence X, defined as: Where xi m denotes the traffic flow collected at the m location at time i.Step2.Initialize the parameters W、Wir、Wiz、Wig、Uhr、Uhz、Uhg、bir、biz、 big、bhr、bhz、bhg in the network.Step3.The matrix X is input into GAT and obtain spatial feature vectors.Step4.The output vector of GAT is used as the input of BiGRU.According to formulas ( 4)-( 7), the state value of the current hidden layer of the unidirectional GRU is calculated.Step5.The forward and backward stacked GRU networks are combined, and obtain the output ℎ of the BiGRU network according to formulas ( 8)- (10).Step6.ℎ is passed to fully connected layer and obtain the predicted result Y.
Step7.The Adam optimization algorithm is used to update the weights.
Step8.Repeat the above process until the maximum number of epoch is reached.
Step9.The training is complete.
In this paper, time series data is input into the model, and the GAT structure with multiple attention mechanisms is used to enable the model to learn spatial dependencies through multiple attention modules, thereby capturing spatial features.The time series data with spatial features is input into the BiGRU structure to learn temporal features, and the predicted values are output through a fully connected neural network.Figure 4 shows a flow mechanism of traffic flow information in the GAT-BiGRU model.For the historical traffic flow information X1 at a moment on the road network, it is first input into the model to obtain the hidden state h1, then h1 and the traffic flow information X2 of the next moment obtained through the GAT structure are input into the BiGRU to obtain the hidden state h2.This process is repeated until the predicted result at time t is obtained (the final predicted value is obtained by outputting through the fully connected layer).

Description of experimental datasets
In this paper, the traffic dataset PeMSD4 from the Caltrans Performance Measure System in California, USA was used to conduct experiments and validate the model.The flow feature was used as the target to predict the traffic flow in the next hour.80% of the data were used as the training set and 20% of the data were used as the test set.The specific information about the dataset is shown in Table 2.

Experimental parameter setting
The experiment was developed based on the deep learning framework PyTorch (GPU version), using the Adam optimizer to optimize the model.The detailed hyper-parameter settings are shown in Table 3.  GAT  6   Since the number of attention headcounts had a significant impact on the performance of the model, it was necessary to select a suitable value to achieve the best experimental results.Through repeated experiments and comparisons, as shown in Figure 5, when the number of attention headcounts H=8, the errors of MAE and RMSE were the smallest, so in a series of subsequent experiments, H was fixed to 8.

Experimental results analysis and performance comparison
In order to verify the superiority of the model, this paper compared the predictive performance of the GAT-BiGRU model with four benchmark models (i.e., GAT [14] , GCN [11] , GAT-GRU, STGCN [13] ). Figure 6 shows the curve trend of predicted values and true values of the GAT-BiGRU model on the PeMSD4 dataset.As can be seen from Table 4 and Figure 6, the combined model proposed in this paper achieved the best performance in both evaluation metrics.Among them, GCN and STGCN used graph convolution to extract spatial features of traffic flow.GAT, GAT-GRU, and GAT-BiGRU used graph attention mechanism to extract spatial features from traffic flow data.Compared to the single GAT and GCN model, the combined model obtained better prediction results.In addition, the GAT-BiGRU model performed better than the STGCN and GAT-GRU models that could extract temporal and spatial correlation features, thus demonstrating the superiority of the model proposed in this paper in mining spatial-temporal correlation features of traffic information.

Conclusion
A spatial-temporal traffic flow prediction model is proposed in this paper.An adjacency matrix is constructed and the highway network is modeled using GAT and BiGRU.On the one hand, GAT is used to capture the spatial features of the graph through the spatial topology structure of the graph, and on the other hand, the BiGRU model is used to capture the dynamic changes of road traffic flow and obtain the temporal features of the data.Finally, the GAT-BiGRU model is used to complete the spatial-temporal traffic flow prediction task.Experiments and evaluations on real traffic datasets show that the GAT-BiGRU combined model has higher prediction accuracy and better performance than the single GAT and GCN models and the GAT-GRU and STGCN-based combined models.In fact, the traffic flow of highway is also affected by a variety of external factors such as weather and accidents.Therefore, in future work, some external factors will be considered to further improve the predictive accuracy of the model.

Figure 1 .
Figure 1.The calculation process of GAT.In the GAT calculation process, it is necessary to transform the input features into higher-dimensional features.Firstly, a shared linear transformation W parameterized by a weight matrix is applied to each node h.Then, self-attention is performed on the nodes (i.e., sharing the attention mechanism), and the attention coefficients eij of the nodes and their neighboring nodes are calculated.Finally, the soft max function is used to perform normalization so that the coefficients of different nodes can be readily compared.Multi-headed attention mechanism is introduced to enrich the model and stabilize the training process in the GAT.The computation process of multi-headed attention for a single node is shown in Figure2:

Figure 2 .
Figure 2. Multi-headed attention calculation for a single node.The obtained normalized attention coefficients are used to calculate the linear combination of the corresponding features h1 m (the features of node 1 at layer m is calculated in the figure) and are used as the final output features h1 m+1 (the features of node 1 at layer m+1) of each node.Two methods can be used for information aggregation with attention mechanisms: one is concatenation (i.e., concat).The other is an average operation (i.e., avg).

Figure 3 .
Figure 3. BiGRU network structure.Assuming that xt is the input vector at moment t, the computational process of GRU network can be expressed as follows:

Figure 6 .
Figure 6.Comparison of predicted and true values of GAT-BiGRU.As can be seen from Table4and Figure6, the combined model proposed in this paper achieved the best performance in both evaluation metrics.Among them, GCN and STGCN used graph convolution to extract spatial features of traffic flow.GAT, GAT-GRU, and GAT-BiGRU used graph attention mechanism to extract spatial features from traffic flow data.Compared to the single GAT and GCN model, the combined model obtained better prediction results.In addition, the GAT-BiGRU model performed better than the STGCN and GAT-GRU models that could extract temporal and spatial correlation features, thus demonstrating the superiority of the model proposed in this paper in mining spatial-temporal correlation features of traffic information.

Table 2 .
Details of the dataset.

Table 4
shows the comparison results of different models in predicting traffic flow for the next hour.

Table 4 .
The comparison results of different models.