Short-term Forecast for Average Speed of Road Section based on Floating Car Data with Support Vector Regression

The recent intelligent transportation system has yielded remarkable progress on traffic data collection, resource allocation and intelligent programming. However, development on traffic real-time data processing and prediction still remains limited. In pursuit of real-time prediction on road section average speed, we introduced a prediction method, which mines GIS floating car data with support vector regression algorithm. The result indicated our proposed method was superior in comparison with other commonly used algorithms including linear regression, artificial neural network, Bayesian regression and ridge regression. Besides, the quick convergence and well fitting confirmed the plausibility of our method in this domain.


Introduction
The quick development of cities has posed great demands in building intelligent systems, which can provide us with better services in living and well manage the limited resources. Among them, the Intelligent Transportation System (ITS), targeting at optimizing the urban transportation resources, has attracted more and more attention ranging from industry, academia to government agencies [1]. Integrating information, communications, abundant sensors and other plausible technologies, ITS makes attempt to guide automatic data analysis, efficient resource allocation and smart service assignment with people, road networks and kinds of vehicles involved in [2]. Though ITS has been widely employed in city planning for a few decades, a recent significant change in ITS is revealed as the explosion of volume of data in transportations. Potential causes for this phenomenon are the rapid development of sensors for data collections and various social platforms in sharing the transportation information. A significant conclusion can be drawn that the availability of massive data is leading a revolution in the domain transferring the conventional technology-driven transportation systems to data-driven ones. To exploit disciplinary data from multi-sources and advance the performance and efficiencies in transportation systems, machine learning techniques are introduced to address related problems. A significant advantage of employment of machine learning in dealing with transportation big data lies in mining task-beneficial patterns, uncovering predictable disciplines and optimizing the process of decision-making [3]. Instead of collecting expert experience, machine learning is capable of capturing plausible rules and enhancing the generalization of systems in the era of big data [4].
With respect to ITS, there are several issues to address, such as traffic status supervision, automatically path planning, transportation vehicles allocation and etc. Among them, the prediction of real-time traffic status is regarded as one of the most critical topics, which bridges the gap between data collection and ITS design. Specifically speaking, potential information after forecasting the average speed of road sections in a real-time way can well assist the traffic control and ensure the free movement in traffic if decision-makers accommodate properly. Meanwhile, the intuitive traffic information allow more plausible strategies in choosing routines for travelers. In this paper, we would focus our attention on machine learning algorithm's employment in solving the short-term prediction of the average speed of road sections. After collecting the GIS floating car data generated from urban roads and pruning the redundant information of data, we performed calculations of the average speed of floating cars on some road section. Considering the robustness and universal generality, a classical algorithm called support vector regression (SVR) was adopted in constructing the prediction model.
The remainders of this paper are arranged as follows. Some related works are summarized in Section 2, in which we discuss methodologies of average speed prediction in road networks. Then, we detail the methods and techniques in collecting and processing the GIS floating car data as well as calculations on the average speed of cars on some road in Section 3. The SVR prediction model is introduced in Section 4 and we elaborate the principles and essence of such model. In Section 5, we collect the transportation data in Shenzhen in the form of time window to verify the plausibility and effectiveness of SVR's employment in average velocity real-time prediction.

Related work
The field of function estimation and regression prediction has witnessed significant progress after the adaptation of SVM method which was proposed by Vapnik [5]. A novel and efficient pairing nvsupport vector regression (pair-v-SVR) algorithm was introduced by Pei-Yi Hao, which successfully combines the advantages of twin support vector regression (TSVR) and classical ε-SVR algorithms [6]. In the task of identification of nonlinear systems in RKHS spaces, SVR method has proven its effectiveness and shown the excellence in fitting residue and superiority of the regularization network in reducing computation time [7]. Considering the complex characteristics of traffic system such as nonlinearity, time-varying, randomness and uncertainty, Thomas Epelbaum et al. utilized deep learning models to capture regression disciplines in time series data [8]. These algorithms are designed to address real-time average speed prediction of road section based on Floating Car Data (FCD).
The former works with respect to traffic flow characteristics focus mainly on relationships between three traffic flow characteristics as traffic flow, average speed and density. Greenshields firstly developeda linear model describing the relationship between velocity and density in 1993 [9]. The model, which was adopted by the U.S. Department of Transportation, assumed that the flow rate is linear with the velocity before it reaches to the maximum and then is illustrated in curve relation when the flow rate is between the maximum and the coordinate origin point [10]. Natalia Isaenko et al. designed an integrative framework which was capable of recognizing and selecting suitable method for traffic forecasting with individual FCD [11].
The main challenge of analyzing the FCD comes from the geographic data error. Map-matching is an important step in the information processing that can minimize errors effectively. Jia-Ching Ying et al. developed a novel modularity-based map-matching algorithm called Urban Map-Matching (UrbMatch) utilizing urban GPS trajectories [12]. The method called spatial and temporal conditional random field (ST-CRF) has better performance and robustness when facing the low-frequency trajectory data(e.g., one GPS point for every 1-2 minutes) [13]. Mahdi Hashemi put forward a weight based map-matching algorithm, which can be applied in real-time complex urban road networks [14]. In order to calculate the average speed of road sections, Yanace J. L. et al. performed least square method on the instantaneous speed of the floating car [15]. However, from the perspective of statistics, it is difficult to control the estimation error.

Floating vehicle sample collection
Floating car, also called probe vehicle, refers to a vehicle equipped with a GPS positioning system and a wireless communication device. They can collect their own traffic data, such as speed, transmission time, latitude and longitude, direction, passenger status, the distance between the last point and other information on roads. The collection of FCD is a sampling survey process on the road traffic network.
The number of floating car samples should be determined before the collection of floating car traffic information. The quantity of floating cars in the road network should be big enough to ensure the accuracy of traffic flow parameter estimation. At the same time, however, the relationships between road coverage, the information update cycle and the number of floating vehicles should be furthr discussed. The former works [16,17] provide ways to determine the number of floating car samples, with multiple factors considered. The relationship between the number of floating car samples, traffic parameters, road coverage, and information refresh cycle can be approximated as follows: Where N is the number of floating car samples; β is the road coverage; is the floating car density: Where, is the average speed of traffic flow; t is the information update cycle and l is the length of the link.
Given the length of the link, the information refresh cycle, and the traffic flow rate, the relationship between coverage and sample size can be obtained.

Floating car data processing
After accessing the data of the floating car, it is generally the first step to match the map and prune some anomaly data. Map-matching is quite crucial for floating car information processing. The intrinsic ideology is to compare the vehicle locating trajectory obtained by the data acquisition system with the road information in the electronic map database, and then map the vehicle to the most probable position on the map by some effective algorithm [18]. In this paper, we made use of the ST-Matching algorithm to embed the information of spatial connection in the road network. The Figure 1 illustrates the variation after matching crossroads data point with ST-matching algorithm. In order to eliminate the invalid data, the criteria for determining invalid data should be determined. Basing on the requirements of predictive modeling and the actual situation, this paper identifies the following screening criteria:  The wrong data. Repeated records. The data in one car ID which have different positions but the same time. The data of two neighbor points with the same car ID which have different positions but the distance is 0.
 The invalid data Considering the taxis without passengers tend to find the business at a low speed or parking, their data cannot reflect the real situation. If a same ID car in the same latitude and longitude for a long time (2 min above), we regard these vehicles in abnormal driving conditions, which data should be removed. If two neighbor floating car points have a too-long (120 seconds above) time interval, it cannot truly reflect the real average speed. Due to the speed limit on urban roads, the calculated car speed should satisfy the constraint. The data whose average speed is more than 80km/h should be removed [19].

Support vector regression model
Two typical SVR algorithms, ε-SVR and -SVR, are commonly used in regression. -SVR is superior to ε-SVR in terms of regression accuracy and maneuverability [20]. In order to avoid the artificial error in the choice of coefficients, we chooses -SVR as the prediction model. One of advantages in SVR is the powerful feature transformation trick which maps the input into high dimensional Hillbert space. The nonlinearity of the mapping empowers better representational capability of characterizing the potential decision boundary. Besides, the model is embedded with the tolerant mechanism which makes the model more stable even in the presence of response noise. We provide brief description of -SVR formulation following Vapnik as follows [21]: Given the training dataset where the x is the attribute vector while the y is the response variable, the objective of SVR is to capture the decision function in the form as . It can be noticed the decision function is accompanied with the potential feature transformation which can be learned in a parameter tuning process.
The hyper parameter C controls the extent of tolerance to the noise and the norm of parameter, can be viewed as the penalty term which generally leads better generalization [22]. The count of support vector is controlled by the parameter , in range (0, 1]. Note that training samples falling inside the -tube have zero loss, and samples outside the -insensitive zone are linearly penalized using the slack variables ≥ 0, i = 1,…,N. Usually, problem (3) is usually solved in its Lagrange's dual form, (see [23] for details): Where the kernel .
Solution of the dual formulation (4) yields optimal values of parameters that can be used to construct the optimal SVR function: optimal solution, training samples with non-zero coefficients are the support vectors (SVs), corresponding to data points at the boundary or outside -insensitive zone.
The non-linear kernel can be computed with the dot product in (7) to extend linear SVR to a non-linear setting. This kernel implicitly captures the non-linear mapping of the data x x . Some commonly used kernel functions includePolynomial kernel, Radial Basis Function (RBF), and Sigmoid kernel [24].
The prediction index of the model is the average speed of the target road section, as well as its upstream and downstream sections, over the past period of time. To construct the input matrix, we assume the average speed of road l on time t is and the input data format is: [ ] Where p is the number of periods (1 minute) that needs to be traced back when predicting the average speed. The data is divided into training set and testing set to carry out related experiments.

Experiment
In this section, we will employ the taxi float car data of Futian District, Shenzhen on October 1st, 2017 from 17:00 to 19:00 in the experiment. After the data preprocessing, we compared SVR predicted results with results using other commonly used regression prediction algorithms.

Data Processing
We select the average speed data of the unidirectional section of Hongli Road from east to west (from Caitian Road to Xintian Road). The parameters of the target section and its related sections of the attributes shown in Table 1. We first run a map-matching process and then the scattered points are projected onto the corresponding line segments of the road based on the driving direction of the vehicle. The speed and driving direction of the vehicle can be calculated by the change of the latitude and longitude.
According to the calculation formula mentioned in Section 3.1, the road coverage rate is calculated as follows: We define the length of a time interval as 5 minutes and the road coverage rate of each road segment are calculated under the minimum sample quantity and the lowest average speed. The minimum number of samples is collected from the time interval when there are least FCD points on the road. As can be seen in the Table 2, the road coverage rate can reach more than 90% with the minimum of average speed and the minimum of samples. In this case, the sample size in any time interval can satisfy the minimal accuracy requirement.

Prediction model
According to the requirements of SVR model training and the characteristics of road sections, the average speed of the road sections needs to be prepared in the following format. Where p is the backtracking coefficient.
The calculated data of the various road sections are represented in a matrix according to the format in Table 4, and the data is processed by setting p = 3 as an example first, a matrix of [111 × 16] is formed. Then, we perform normalization on these data. Finally, we partitioned the first 80 lines as training set and the last 31 lines as testing set.
In the process of model training, the parameter C of the RBF kernel ranges in the interval [-5,200], with a step of 5, and the parameter g of the kernel ranges in the interval [-5,5], step size is 1. We performed grid search on these parameters to obtain the optimal parameters. The result is on Figure 2: Basing on the result of the rough selection of the parameters, we narrow the search range. We set the parameter C, in the interval [-1, 50], with a step of 0.5, and the parameter g, in the interval [-1,1], with a step length of 0.1, take an accurate optimum calculation. The result is on Figure 3.
According to the optimization results, the best C parameter is 16, g is 0.5. The mean square error at this time is 18  To better evaluate the performance of the model, we compared the predicting results of SVR, ANN, linear regression, Bayesian ridge and ridge regression on the same dataset. The results shows on Figure 5, The MSE of each algorithm results on the testing set shows on Table 4.

Discussion and conclusion
In this work we demonstrated the superior performance of SVR on the average speed of the road section regression forecast. Instead of considering the data from a single road section, we aggregate the data of related road sections in the high dimensional space, aiming at achieving quick convergence. Our results suggest that SVR can handle this type of input well, with a smaller mean square error than the other algorithms. The condition considered in this work is a typical scene on the urban road network. SVR can deal with the forecast problems under longer backtracking time and the complex road network conditions. We can find more regular patterns in the daily long-term data of each road section in our forecast index system, when accessing more data. In future work, we would address these limitations, adapting our forecast method to accommodate the daily long-term and road network domains.