Research on air quality prediction method in Hangzhou based on machine learning

Air pollution has become the subject of many current environmental studies, and the quality of air is directly related to the quality of life and health of human beings. In this paper, the Bayesian network model is used to predict air quality in Hangzhou. Six air pollutants SO2, NO2, O3, CO, PM2.5 and PM10 are used as the evaluation factors of the model, and AQI value is the output of the model, and then the Bayesian network model is established. Finally, the model is used to predict air quality and compare with the actual value. The results show that the accuracy of air quality prediction is over 80%, and the predicted value is close to the actual value in most cases, and this shows that Bayesian network model has a certain practical value as a means of air quality prediction.


Introduction
In recent years, due to the rapid development of China's economy, environmental problems have become prominent, especially air pollution. The air quality is directly related to the quality of human life, health and safety [1]. According to the statistics of 2019, China accounts for 7 of the top 10 air pollution cities in the world, which means that China has a long way to go in air pollution control. Therefore, studying the causes of air pollution through big data and predicting the air quality status and change trend in the future can provide scientific decision-making basis for environmental monitoring departments to reasonably control, manage and effectively prevent air pollution [2].
Many scholars have done research on air quality prediction methods. Wu used GM(1,1) model with the fractional order accumulation (FGM(1,1)) to predict the average annual concentrations of SO 2 , NO 2 , O 3 , PM2.5 and PM10 in the Beijing-Tianjin-Hebei region from 2017 to 2020 [3]. Nevin used Fuzzy C-Auto Regressive Model (FCARM) as a prediction model to reflect the regional behavior of weekly PM10 concentrations in Turkey [4]. Zhu adopted two hybrid models (EMD-SVR-Hybrid and EMD-IMFs-Hybrid) to forecast air quality index (AQI) data, and the AQI forecasting results of Xingtai showed that the two proposed hybrid models are superior to ARIMA, SVR, GRNN, EMD-GRNN, Wavelet-GRNN and Wavelet-SVR [5]. Yang proposed a new air quality monitoring and early warning system, including an assessment module and forecasting module [6]. In the air quality assessment module, fuzzy comprehensive evaluation is used to determine the main pollutants and evaluate the degree of air pollution more scientifically.
The methods studied by the above scholars are more traditional mathematical and physical model analysis methods. Now, with more and more air quality monitoring sites set up, the time and space span of acquisition are more and more fine. For the processing and utilization of massive data, the prediction model established by intelligent algorithms such as machine learning has great research prospects [7][8][9].
Wu proposed a novel optimal-hybrid model, which fuses the advantage of secondary decomposition (SD), AI method and optimization algorithm for AQI forecasting, and the results indicated that the proposed optimal-hybrid model comprehensively captures the characteristics of the original AQI series and has high correct rate of forecasting AQI classes [10]. Sagar choose a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) model to perform the task of air quality forecasting and got a good result [11]. Seng proposed a comprehensive prediction model with multi-output and multi-index of supervised learning based on long short-term memory (LSTM), and LSTM was used for training to obtain the predicted values of air quality pollution indicators [12]. Zhang improved the supervised LSTM model by introducing unsupervised feature learning for air quality predictions [13].
Machine learning is to achieve the purpose of classification and prediction through feature extraction and data fitting. In this paper, the machine learning method is used to complete the air quality prediction of Hangzhou, and the daily average monitoring data of air pollutants (SO 2 , NO 2 , O 3 , CO, PM2.5 and PM10) in Hangzhou from March 2018 to April 2021 is used as the training sample database to build a Bayesian network model to predict the AQI of Hangzhou.

AQI and Bayesian network model
AQI describes the degree of air cleanliness or pollution and its impact on health [14]. The AQI currently used in China is divided into five levels as shown in table 1.  [15]. Among them, the maximum value of the each pollutant IAQI is the AQI, as shown in the following formula (1).
Where n is the pollutant item, and the IAQI can be calculated by the following formula (2). ( Where IAQI is individual air quality index of pollutant m, C is the mass concentration value of pollutant m, BP and BP are the high and low value of the concentration limit of pollutants similar to C obtained by looking up the table respectively, IAQI and IAQI are the individual air quality index corresponding to BP and BP obtained by looking up the table respectively.

Bayesian network model
Bayesian network, also known as belief network or decision network, is a directed acyclic graph (DAG) that represents the interdependence between nodes [16]. Its characteristics are compact and intuitive. The core of Bayesian network reasoning includes prediction and diagnosis. Random variables are represented by nodes, in which there is a conditional probability table containing probability information between nodes. Bayesian network can combine conditional probability with network topology, and can combine a priori probability and conditional probability to obtain a posteriori probability to achieve the effect of prediction, which is the advantage of Bayesian network compared with other algorithms. The following Bayesian formula (3) is used for the calculation of Bayesian network nodes.
In the formula, the probability of event is , the probability of event under the condition that event has occurred is | , and the probability of event under the condition that event has occurred is | . The composition and construction of Bayesian network have three steps: (1) Determining variable nodes and variable domains; (2) Bayesian network learning, including structure learning and parameter learning, determining network topology and conditional probability table; (3) Bayesian network reasoning.
In this paper, Python is used to process data and build Bayesian models.

Air quality dataset
The air quality data of The acquired air quality data may be incomplete, missing and inconsistent, and these problems will have an impact on the data modelling and training, it is necessary to verify and clean the original data set. The non-standard and missing data can be processed with programs in Python. The curve of air quality AQI in Hangzhou in recent three years is shown in figure 1, and the change of AQI and SO2 index in Hangzhou from May 1, 2020 to May 1, 2021 is shown in figure 2. As can be seen from figure 1, the air quality index AQI of Hangzhou shows seasonal periodic changes, and the overall air quality index shows gradient improvement. It can be clearly seen from figure 2 that the SO 2 index and AQI index show a follow-up curve, and the maximum peak is from December to January. The reason behind this is that the residential power consumption increases in winter, the pollutant emission increases, the air pressure decreases, and the air quality is at a low value throughout the year.

Selection of prediction factors and calculation of mutual information value
The purpose of collecting the content of six air pollutants and AQI is to take six pollutants as predictors, and the correlation between six air pollutants and AQI needs to be further verified. The method selected in this paper is to use the mutual information value calculation formula to calculate the mutual information values of six pollutant data and AQI [12]. If the mutual information values are greater than the set threshold (0.01Bits), it means that it is appropriate to select each pollutant index as the prediction factor of air quality in Hangzhou.
Where , are random variables, and respectively represent the number of values of random variables and , and respectively represent the attribute values of the and of the random variables and , , is the probability when and states are and respectively, and respectively represent the probability when and states are and respectively. The output value calculated by Python is shown in table 3.

Data discretization
The establishment of Bayesian network model is similar to other machine learning and classification algorithms, which needs to discretize the sample data. According to the five level classification standard of air pollution index (AQI) currently adopted in China, the data with AQI of 0-50 is marked as 1, the data with AQI of 51-100 is marked as 2, and the data with AQI of 101-150 is marked as 3. According to figure 1 and more data analysis, the number of days with AQI above 150 in Hangzhou in recent three years accounts for less than 5%, and the actual number of days with AQI above 200 is 0%. Therefore, the data with AQI above 150 is marked as 4. In other words, when AQI is 1, the air quality is excellent, when AQI is 2, the air quality is good, when AQI is 3, the air quality is slightly polluted, and when AQI is 4, the air quality is severely polluted or above.
Since the data of SO 2 , NO 2 , O 3 , CO, PM2.5 and PM10 obtained are the content in each cubic meter of air, because the properties and hazards of pollutants are different, the absolute values of each pollutant cannot be equal. These pollutants should be discretized as AQI values. In this paper, the data of air quality pollutants are discretized according to the "ambient air quality standard" (GB3095-2012 standard). Table 4 shows the attribute values of the corresponding standards for discretization. All air quality data can be discretized by this discretization method, and some discretized data are shown in table 5.

Experimental environment
The computer operating system is win10, the CPU model is 4415u, dual core, four threads and 12G memory. Use Python platform to build the model and obtain the data.

Bayesian network construction
Use Sklearn library, one of the third-party libraries based on Python, to develop machine learning. Sklearn supports four machine learning algorithms, including classification, regression, dimensionality reduction and clustering, and it also includes three modules: feature extraction, data processing and model evaluator.
This study is based on the historical data of SO 2 , NO 2 , O 3 , CO, PM2.5 and PM10 to predict the air quality AQI in the future. Although the AQI is directly calculated from the values of SO 2 , NO 2 , O 3 , CO, PM2.5 and PM10, the impact of current pollutants and AQI values on future AIQ is not known without the model. The Bayesian network prediction model is established, and the naive Bayesian algorithm of Sklearn library is used to perform the maximum a posteriori algorithm between any two of six pollutants and AQI, calculate the prior probability and conditional probability of variables, reason the whole probability condition, simulate the relationship between various variables, and establish the Bayesian network model, and the Bayesian network diagram is shown in figure 3.

Model validation and evaluation
Using the bidirectional reasoning ability of Bayesian network, 80% of the data samples are used for Bayesian model training, and the remaining 20% of the data are used to test the obtained Bayesian network prediction model to predict the air quality of the next day. In this paper, 4 levels of discrete data are used. If the predicted air quality level is at the same level as the actual air quality level, the prediction is considered to be effective. Comparing the effective prediction with the overall, the final prediction accuracy is greater than 90%, as shown in figure 4. The air quality prediction results of Qiandaohu from September 1, 2020 to September 20, 2020 are shown in table 6. It can be seen from the table that the air quality of Qiandaohu is relatively stable, and the predicted results are completely consistent with the actual air quality level. Some prediction results are shown in figure 5.  Figure 5. Comparison between predicted and actual air quality values As can be seen from figure 5, the prediction accuracy of the Bayesian network prediction model for air quality in this paper is more than 80%, and the predicted value is close to the actual value in most cases, but it is still inaccurate, and the actual air pollution level is higher than the predicted value, because the air quality of Hangzhou is better all year round, and the number of samples with light pollution or above is small, resulting in the lack of sufficient training data, and the established Bayesian network model has a little reference value, however, the accuracy can be further improved.

Conclusion
In this paper, Bayesian network model, an air quality prediction model based on machine learning, is used to predict the air quality in Hangzhou. Six air pollutants SO 2 , NO 2 , O 3 , CO, PM10 and PM2.5 are used as the evaluation factors of the model, the AQI value is used as the output result of the model, and the mutual information value between them is calculated to establish a Bayesian network model. The model is trained and verified by using the historical data of air quality to obtain the comprehensive accuracy. Finally, the model is used to predict air quality and compared with the actual value. The results show that the prediction accuracy of air quality is more than 80%, and the predicted value is close to the actual value in most cases, which shows that Bayesian network model has a certain reference value as a means of air quality prediction. In addition, temperature, wind, precipitation and other meteorological conditions and seasons are also factors that directly affect air quality. Therefore, considering these factors and further improving the Bayesian network model to improve the accuracy of air quality prediction is one of the future research directions.