Using support vector regression to predict PM10 and PM2.5

Support vector machine (SVM), as a novel and powerful machine learning tool, can be used for the prediction of PM10 and PM2.5 (particulate matter less or equal than 10 and 2.5 micrometer) in the atmosphere. This paper describes the development of a successive over relaxation support vector regress (SOR-SVR) model for the PM10 and PM2.5 prediction, based on the daily average aerosol optical depth (AOD) and meteorological parameters (atmospheric pressure, relative humidity, air temperature, wind speed), which were all measured in Beijing during the year of 2010–2012. The Gaussian kernel function, as well as the k-fold crosses validation and grid search method, are used in SVR model to obtain the optimal parameters to get a better generalization capability. The result shows that predicted values by the SOR-SVR model agree well with the actual data and have a good generalization ability to predict PM10 and PM2.5. In addition, AOD plays an important role in predicting particulate matter with SVR model, which should be included in the prediction model. If only considering the meteorological parameters and eliminating AOD from the SVR model, the prediction results of predict particulate matter will be not satisfying.


Introduction
In recent years, with the rapid development of industrialization and urbanization in China, urban air pollution has been a grow problem, especially for urban communities. Health effects differ upon the size of airborne particulates. In this contribution, PM 10 and PM 2.5 (particulate matter less or equal than 10 and 2.5 micrometers respectively) are considered due to its effect on human health. Epidemiologic studies indicate strong links between the concentration of PM 10 or PM 2.5 with public morbidity, mortality of respiratory and cardiovascular diseases. Recently, PM has become the primary air pollutant in most major cities in China, which not only threatens people's health, but also causes the decrease of atmospheric visibility and the degradation of the city scenery [1][2][3]. Support vector machines (SVM) are a new statistical learning technique, based on machine learning and generalization theories, it implies an idea and could be considered as a method to minimize the risk. Besides, a generalization capability makes possible their application to modeling dynamical and non-linear data sets. This study is motivated by a growing popularity of support vector regression (SVR) problems, which leads to better generalization than conventional methods such as artificial neural networks (ANN) [2,[4][5][6][7][8]. This paper presents a study of using the SVR model to investigate of PM 10 and PM 2.5 , which were measured in Beijing during 2010-2012. The successive over relaxation support vector regressions (SOR-SVR) are trained by performed on the data of PM2.5 (or PM 10 ), the aerosol optical depth (AOD) and meteorological parameters (atmospheric pressure, relative humidity, air temperature, wind speed), which were also measured at Beijing at the same period. With the SVR model, based on AOD and meteorological parameters, we can predict the regional particulate matter (PM) in Beijing in China.

Support vector regression
where  is the normal vector， b is the threshold， C is a regularization constant determining the trade-off between the training error and the generalization performance,  and *  are the slack variables, ε is the tolerance (error acceptance). Then the function as This problem is called ε-support vector regression (ε-SVR) and a data point

SOR-SVR model
For the standard model of SVR, if we append the term 2 b to T , that is to say, maximize the margin between the parallel separating planes by optimizing with respect to both  and b , meanwhile, we change the expression form with matrix and vector, this leads to the following reformulation of the SVR problem as K is the kernel function for nonlinear case and L is the strictly lower triangular of the symmetric matrix Q . Thus, dual problem can be simplified as Using the successive over relaxation (SOR) method to solve Eq.(6), we get the iterative formula of SOR algorithm as following [4]: where ( )  denotes the nonnegative gradient projection: Then the regression function can be written as This SVR problem is called SOR-SVR model and Figure 1(a) further illustrate the diagram of this model.

Collect and analyze dataset
The data collected to be used in the study refer to air quality in Beijing and contains the particulate matters (PM 2.5 ) monitored by U.S. Embassy (United State Embassy in Beijing, China), aerosol optical depth (AOD) measured by the sun photometer Ce318 at RADI (Institute of Remote Sensing and Digital Earth, CAS), and the meteorological data, including air pressure, temperature, relative humidity, and wind speed, which are measured by China Meteorological Administration. The three observation sites are not far away from each other, and those data were measured from 2010 to 2012.   Wind speed m/s Take the PM 2.5 prediction as an example, after deleting the outlier from the dataset, we can get the daily average results of PM 2.5 , AOD and meteorological variable respectively, then 300 data are selected for the simulation and discussion, The variable of dataset are presented in table 1, and the time series data are shown in figure 2. For the AOD results, we have already used the Angstrom formula to transform the AOD result at 550nm.

Design and test SOR-SVR model
There are several issues that we need to consider in the SOR-SVR application. First of all, some parameters must be determined before running the particular algorithm. These parameters are error acceptance (ε), constant (C) and kernel specific parameters. In this work, gauss kernel function was used, where  is the parameter that determines performance in the learning of the kernel function.
As we train and test the SVR model, k-fold cross validation and gird search method are also used, consequently the optimal parameter ε, C and  can be obtained. While the k-fold cross validation method is employed, with the defined value of ε, C and  , the dataset is divided into k parts, the k-1 parts among which can be selected as the training dataset and the remaining part as the testing dataset, then the average relative testing error can be gained. In this way, we can train and test the model k times, and then get the average value of the each testing error as final error. With different values of ε, C and  , we can search and select the parameter (ε, C,  ) which corresponding to the minimum final error as the optimal parameters in the SOR-SVR model. The diagram of the prediction with SVR model can see figure 1 (b).   Figure 3 shows the comparison results of actual PM 2.5 and predicted result while we use the training dataset as the testing dataset for different parameter (ε, C, ). As we can see in figure 3 (a), though the regression results match with the actual PM 2.5 in an extremely good condition, the generalization ability of this model is not very good, due to the over-fitting. With the k-fold cross validation, the over-fitting condition can be prevented, meanwhile, combined with grid search method, the optimal parameter (ε=0.1, C=100,  =0.3) for the good generalization ability can be obtained.

Results
With the optimal parameter, we select 250 days' data as the training dataset and 50 days' data as the testing dataset, the predicted result of PM 2.5 can be found in the figure 4. From the figure, we can see that the predicted result by the SOR-SVM model agree well with the actual data, the correlation coefficient 2 =0.87 R and the average error is 12.66 μm/m 3 , which can prove that the SOR-SVR model has a good generalization ability to predict PM 2.5 . Besides, in order to investigate the important role which AOD played in the prediction of PM 2.5 with SVR model, we also consider using the meteorological parameters (atmospheric pressure, relative humidity, air temperature and wind speed) directly to predict the PM 10 and PM 2.5 in the same as above. However, compared with the condition combined with AOD variable, the predicted results are satisfying, the correlation coefficient 2 R is about 0.7 and the average error is significantly larger. Due to the reason of space limitations, the detailed results are not shown here.
With the SOR-SVR model, we can further use the AOD retrieval result from the satellite remote sensing, combining with the corresponding meteorological data, to predict the particulate matter in a regional distribution in our next study.

Discussion and Conclusions
The potential of applying support vector machine (SVM) in PM 10 and PM 2.5 prediction is studied and presented in this paper, based on the daily average aerosol optical depth (AOD) and meteorological parameters such as atmospheric pressure, relative humidity, air temperature and wind speed. The predicting model is developed by using the successive over relaxation support vector regression (SOR-SVR). The Gaussian kernel function, as well as the k-fold crosses validation and grid search method, is used to obtain the optimal parameters (ε, C,  ) to get a better generalization capability. For the final model, 250 days' daily average data are selected as the training dataset and 50 days' data as the testing dataset, the predicted results agree well with the actual data, and show that the SOR-SVR model has a good potential and generalization ability to predict PM in the atmosphere. In addition, AOD plays an important role in predicting particulate matter with SVR model, which should be included in the model. If only considering the meteorological parameters and eliminating AOD, the prediction results of predict particulate matter will be not satisfying. With the SOR-SVR model, we can further use the AOD retrieval result from the satellite remote sensing, as well with the corresponding meteorological data, to predict the PM 10 and PM 2.5 in a regional distribution in big city such as Beijing in China