Traffic flow prediction based on transformer

Nowadays, it is an obvious fact that with the double blessing of economic growth and the diversity of human’s life, travel has become an indispensable activity during the leisure time. Meanwhile, due to the convenience, timesaving and accessibility, private cars gradually play a vital role on the road. However, it is conceivable that the increase in vehicles means that there will be inevitable congestion, especially during the peak hours of the day. In order to reverse the situation that private cars are from “convenience” to “inconvenience”, this paper decides to explore how to accurately predict the change of traffic flow over time, perceive the upcoming peak period in advance to provide reliable traffic information for publishing and allocate traffic resources reasonably. To capture the time features in traffic flow data, extract their internal laws and process them more accurately, this research applies a combined model of convolutional neural network (CNN) and transformer, that is, CNN is used for the feature extraction, and the extracted features are passed to the transformer model for further predictive analysis. The experimental dataset in the research comes from the official website of Highway England. The result displays that the iteration speed of this model is rapid, the prediction error is able to be minimized and it can be applied in the case of a huge dataset.


Introduction
Nowadays, travel is becoming more and more common and convenient with the increase in the number of private cars.As a result, traffic congestion has gradually been a noticeable problem affecting the normal operation of cities, especially during the morning, evening and holiday travel peaks.How to reasonably dispatch the traffic flow has become an urgent problem for urban managers.As an unavoidable key to solve the congestion problem, traffic flow prediction becomes an inviting research topic in recent years, which aims to predict the number and speed of vehicles on a specific road section at a specific time in the future.
The prediction of traffic flow is essentially the prediction of multivariate time series, which can be divided into long-term, medium-term and short-term prediction varying with the research period.Similar to the prediction of weather and outside temperature, gold market price or stock market, accurate forecasting of the traffic flow can provide people with more convenience and even more profits [1].Recently, many researchers have contributed plenty of results in the study of traffic flow data prediction.Early research mainly utilized methods like Bayesian model, support vector machine (SVM) and auto regressive moving average model to predict traffic flow [2].For instance, Cheng et al. proposed a short-term traffic flow prediction mechanism based on support vector regression (SVR) by using the linear relationship between the states corresponding to each indicator in the traffic flow prediction process, and further analysed the global state of traffic flow [3].Although the prediction accuracy and stability of traditional models are slightly inferior, they have also been widely applied within the transportation field due to their strong universality.However, the traffic system is a complex time-dependent system with extreme unpredictable and nonlinear characteristics, which leads traffic data to be periodic and sudden because of the influence of many elements [4].Even though the data processing and computing capabilities of map navigation and GPS technique continue improving, some traditional traffic volume forecasting models and algorithms are still unable to cope with these complex seasonal and cyclical changes [5].
The latest works rely on the powerful feature representation ability of convolution neural network (CNN) to ensure precise prediction of traffic flow.Since deep learning methods are deeper and more complex than artificial neural networks, they have been increasingly used in research to improve prediction accuracy and have achieved more desirable results.Wang et al. proposed a combination of discrete wavelet transform (DWT) and graph convolutional network (GCN) in order to deeply explore the space-time features of traffic flow sequences and improve the accuracy as for the prediction problems and apparently, the combined model is capable to effectively reduce the error [6].Qi et al. improved the wolf pack algorithm based on the wavelet neural network [7].This sort of combination has the advantage of reducing the errors and speeding up the operation.Zhang et al. added space-time analysis while using convolutional neural network (CNN) to analyse and applied spatio-temporal feature selection algorithm (STFSA) to determine the optimal data [2].Zhou et al. has compounded the convolutional neural network (CNN) with the gated recurrent network (GRU) [5].First, the CNN was exerted to fetch data features, and the obtained results were provided to the GRU for prediction, thereby improving the prediction accuracy.Li et al. introduced the simple and easy-to-extend KNN nonparametric regression method to approach complex and variable nonlinear traffic flow data in order to achieve more acceptable prediction results [8][9].Another example is the incorporation of convolutional neutral network and long short-term memory (CNN-LSTM) model, which was commonly applied and improved by Li et al. [10][11].It realized traffic flow prediction adhered by time-based periods [3].
However, when processing a huge amount of data, it is still inevitable to cause a gigantic challenge to some of the models cited above, such as increasing of the calculation cost with reducing the efficiency and stability.For example, if there are too long sequences, the phenomenon called vanishing gradient may appear while training the LSTM model [1].In response to this problem, Heess et al. [12] initially introduced an attention mechanism in image processing.In order to effectively extract the crucial information in the image, the results proved that this method is feasible.Srivastava et al. introduced the attention mechanism to propose a special architecture based on neural networks, which was applied in biology [13].Wang et al. proposed a neural network model based on the attention mechanism, allowing important words to obtain higher weights, and then optimizing the objective function to significantly improve the effect [14].It can be assumed that although the attention mechanism is a new concept, it has achieved relatively wide application in a short period of time.
Considering that fluctuations in traffic flows are always related to current and past traffic conditions, in this paper, the researcher expect to capture the temporal sequence relationship from the traffic flow and propose a traffic flow forecasting method based on the transformer.Specifically, through the study of various models and the analysis of experimental results, this paper adopts the convolutional neural network to gain the features of the traffic data, and then applies the transformer model to predict [15].As the transformer model inherits an attention mechanism, it can track the relationship among the data and capture the dynamic spatio-temporal correlation in the traffic network, which could ensure all advantages of the convolutional neural network in the precise extraction plus the attention mechanism in the effective capture and memory of long-sequence information to play their significant roles.

Convolutional neutral network (CNN)
CNN is a network architecture for deep learning that learns directly from data or images.It is specially designed to deal with structures like one-dimensional time series data, two-dimensional space-time matrix data or image data.Compared with traditional neural network algorithms, it improves learning performance through effective feature extraction, parameter sharing, and sparse connections.As shown in Figure 1, a CNN commonly consists of six layers listing below: (1) Input layer: Input datasets or images.(2) Convolutional layer: Convolution is an important computational step in convolutional neural networks.Convolutional neural networks use convolution kernels to connect local but not all neurons of the previous layer to reduce the amount of computation.Each convolution will perform a feature extraction on the input data.(3) Activation layer: The activation function can convert the linear mapping in the neural network into a nonlinear mapping, and then establish a complex functional relationship between the temporal characteristics in the traffic flow and its predicted value.(4) Pooling layer: That is to subsample the output of the convolutional layer.Not only the most representative features after convoluting can be extracted as much as possible, but also the number of parameters in the model can be further reduced, which contributes to shortening the computing time and preventing overfitting.( 5) Full connected layer: The neurons of the next layer and the neurons of the upper layer are linked through weights to establish a functional relationship between the extracted features and the output.That is to establish a functional relationship between the time characteristics extracted by the convolutional layer and the traffic flow predicted value, so as to obtain the more advanced meaning of the time characteristics.( 6) Output layer.

Transformer
The transformer model is somewhat similar to the sequence to sequence (seq-2-seq) model, which is mainly used in natural language processing applications where the input and output are indefinite sequences, such as machine translation.Both models are essentially composed of an encoder and a decoder, which embeds multi-head attention mechanism to learn the dependence of various vectors.Multi-head attention divides inputs into three parts: query vector (Q), key vector (K) and value vector (V) [16].Specifically, query vector and key vector are exerted to calculate the weight of attentions and multiply the result by the value vector.Formula (1) represents the mathematic expression of multi-head attention in transformer: 1 2

⁄ ----Dimension of K
The encoder is used to read the input sequence and obtain the features in it, and then passes the obtained information to the decoder to generate the output sequence.Transformer model introduces selfattention mechanism.Compared to the general attention mechanism which just focus on the similarity of datasets and predicting values, self-attention mechanism is also able to capture the similarity within datasets and predicting values themselves.Besides, transformer model adds positional encoding module to solve the shortage that the attention mechanism is not sensitive to position information.A diagram of the transformer model [16] is provided below in Figure 2.

Dataset information
The experimental data is picked from the official website belonged to Highways England, which manages the core road network in England.It operates information services, publishes, and handles traffic accidents, and maintains liaison with other government agencies.Highways England has recorded the traffic flow data of M-class and A-class highways in England in recent years.This paper selects the traffic recorded by some observation points distributed on an overpass of highways A2 (east-west) and M25 (north-south) from January 1st to January 31st, 2022.Those observation points recorded every 15 minutes.Not only recorded the total volume at a certain moment, but also there were specific volumes of different vehicle sizes.The author collected around 30 datasets which contain almost 90,000 data totally.Takes the point ID M25/4094A as a typical sample to post in the Table 1.

Data processing
The algorithm divides the whole dataset into a training set and a test set as the ratio of 8:2.Furthermore, in order to verify the training effect more accurately, there is also a division in the training set, that is, a training part and a validation part, which the ratio is also 8:2.First, uses the subsequent data to fill in the missing part of the data (N/A value), which is called data pre-processing.The general process of the experiment is to use the data in the training set to train the ideal model, and then this model will be applied to predict the data in the test set.Compare the results with the real value and gauge the magnitude of the error.Before operating the model, standardizes the dependent variable (traffic flow) first to shrink the magnitude of the data and increase the speed of iteration and convergence.Sets 15 records as one stride to predict 15 data.Meanwhile, the researcher determines other parameters such as batch size and number of epochs.In general, a larger batch-size could ensure a lower time-consuming but needs more epochs to satisfy the precision.After a comprehensive comparison, the default batch-size is 256 and the default epoch is 1000.Also, in order to avoid overfitting, set up an early-stop function to determine the round of iteration-terminate after iterating 10 times if the indicator has no improvement within a certain number of rounds.Furthermore, it could choose the optimal number of iterations.Besides, a proper initial learning rate and momentum in the optimizer are also crucial for the model to reach the global optimal solution.The summary chart of model training parameter settings is as follows in Table 2.

Model training
In this research, CNN model is utilized as an encoder and transformer model is regarded as a decoder.
There are mainly two procedures to realize the data prediction: first of all is extracting features among factors of data and secondly, handling those features from the last step.Hereby, CNN and transformer are responsible for the two operations, respectively.
(1) Activation Function.There are several well-known activation functions in deep learning such as Sigmoid, Tanh, Softmax, etc., which are simple, convenient, and timesaving.In general, when it comes to convolutional neutral network, the most commonly used is ReLU function because compared with other activation functions, ReLU has stronger expressive power for linear functions and on the other hand, there is no vanishing gradient problem as for the non-linear functions as the gradient is a constant in the non-negative section.It is also the selected function currently.
(2) Loss Function.One of the goals of this research is minimizing the errors.Naturally, there must be some standards to measure the method within the research.The paper adopts mean squared error (MSE), mean absolute error (MAE) and root mean square error (RMSE) to reflects errors.Formulas of them are displayed: (3) Optimizer.Adaptive Moment Estimation (ADAM) is chosen since not only is it more effective, but also it could be steadier when parameters updating is needed without drastic gradient descent.The initial learning rate can be determined with a relatively adequate value and needs less adjustment.Moreover, ADAM is able to fit an extremely large dataset.

Experiment results
For the verifying of the convergence, the researcher first shows the line graph of the error value after each iteration in Figure 3.The chart manifests that the most optimal appeared at the 3rd epoch, where the loss value is less then 0.032, the ideal training model is formulated at this moment.As for the data predicting problem, errors are also indispensable criterions to measure the model performance.Thus, MSE, MAE and RMSE are hereby posted in Figures 4-6, which could also explain the rapid convergence speed.Finally, compares the forecasting result and the true value of the testing set, whose results can be observed in Figure 7.Even though some gaps are difficult to omit, apparently, two lines are almost fitting together the model proved a desirable goodness of fit.

Discussion
Generally speaking, the model is able to capture the characteristics of traffic volume at different times of a day and could predict the peak and trough of traffic volume.Although the amount of data obtained is relatively small, the complexity and sophistication are relatively deficient, which causes that the test conditions are limited, and thus, underfitting and a little bit undesirable result are inevitable when utilizing this model for data analysis, obviously, convergence speed is rapid during the operation process, and the error can be quickly reduced in a short time.At the same time, due to the introduction of the attention mechanism, the model can extract the time features of traffic data more accurately during operation.Hence, the performance of the model is prompted.
As a matter of fact, traffic flow must be affected by plenty of factors, such as weather, holidays occasional events or some political alterations.Hence, in many other practical studies on traffic flow forecasting, more complicate situations will be taken into consider.Take a step back, traffic flow has obvious temporal-spatial correlation, add spatial variation only could also exponentially increase the complexity of the dataset.Accordingly, the model which is applied to cope with this dataset will be more elaborate.Through the study and analysis of various time-space series prediction methods, the combination of convolutional neural network and attention mechanism "transformer model"which proposed in this paper to predict the traffic volume in a certain period, is more competent to deal with such sophisticated and large-amount dataset.
Moreover, as the author mentioned above, this experiment sets the stride is 15 to predict 15 data.It is proper to deem that this is an empirical value.The researcher has changed the value and conducted several tests, found that the computing budget under the length 15 is the most balanced.In other words, it won't cause a large burden or any imprecise results resulting from the insufficient calculations.It can be noticed from this, some parameters seem unimpressive though, they may lead to entirely different results once they are changed.

Conclusion
The variation tendency of traffic volume has always been a main concern of relevant departments and all travelers.Timely and reliable forecast information can provide an effective basis for traffic management departments to formulate schemes to allocate traffic resources reasonably.Convolutional neutral network and transformer model has conducted in this paper and the test results displays that although the underfitting phenomenon occurs as the amount of data is kind of small, which leads a slight error, the model has an acceptable performance and could minimize the error in a short time.It might be readily speculated that in practical applications when the data amount is huge, the advantages of this model will be more prominent.As for the research on data prediction in the future, this model could be further optimized so that it can be applied to situations when the amount of data is limited or explore other methods that are more suitable for those situations.

Figure 2 .
Figure 2. Diagram of transformer model structure.

Figure 3 .
Figure 3. Diagram of epoch-validation loss from the model.

Figure 6 .
Figure 6.RMSE-epoch chart.Figure 7. Comparison of the predicted value and real value.

Figure 7 .
Figure 6.RMSE-epoch chart.Figure 7. Comparison of the predicted value and real value.

Table 1 .
Traffic flow dataset on highway M25.

Table 2 :
Model training parameter settings.