Advances in the Quality Control Methods of Air Temperature Data at Surface Automatic Weather Stations

There are errors and deviations in the meteorological data for many reasons. With the increasing scientific requirements for the quality of the meteorological data, the quality control technology of the meteorological data has been explored and developed. The air temperature is one of the routine and important factors in surface meteorological observation with one of the longest observation histories since records began. Therefore, the quality control methods for air temperature data at home and abroad are summarized in this paper. The advantages, disadvantages, and applicability of the methods are analysed.


Introduction
The surface meteorological observation data are the premises of meteorological and climatic researches, which are the basis for climate change prediction, weather dynamic analyses, numerical model researches, data assimilation, agricultural decision-making, and disaster prevention [1][2][3]. Therefore, the quality of the meteorological data is very important. The surface meteorological observation data must be representative, accurate, and comparable [4]. However, in the operational meteorological observation, there are some errors and deviations in the meteorological data due to the influence of the observation instrument, observation technology, station location, and observation methods [5,6], which include the systematic error, gross error, random error, and micrometeorological error [7][8][9].
The systematic error is the significant deviation of the data average from reality. The magnitude or the sign of the error is constant or regular, which is related to the instrument site. The gross error is an error without any synoptic meaning, which is the significant deviation from the true value. The random error is a common attribute, which cannot be eliminated by a single measurement. The magnitude and the sign of the error are irregular, but the error conforms to the normal distribution with an average of zero. The micrometeorological error is generated by the disturbance of the small-scale weather system, which is the abnormal value compared with other stations at the same time [10,11]. With the increasing scientific requirements for the quality of the meteorological data, the quality control (QC) technology of the meteorological data has been explored and developed.

Air temperature
The air temperature is a physical variable that reflects how hot or cold the air is [18], which is indispensable not only in theoretical researches but also in the application of national defense and economic development. Air temperature is one of the regular and important factors in surface meteorological observation with one of the longest observation histories since records began. The air temperature shows the typical characteristics of the diurnal variation and basically conforms to the normal distribution. Therefore, the research focuses on air temperature and summarizes the advances in the air temperature QC methods. The QC for the surface meteorological data includes two types. One is the QC for multiple stations through networking, the other is the QC at a single station [19], which applies equally to the air temperature QC.

Research on the air temperature QC at the single station
The QC at the single station is essential, especially where the surrounding stations are sparse or new that cannot provide effective reference data. The traditional QC method at the single station combines the methods of threshold inspection, range value inspection, extreme value inspection, time consistency inspection, and internal consistency inspection into one comprehensive method [20,21]. In recent years, there are new researches on the QC methods at the single station in China.

Referring to metadata
Ren et al. [10] pointed out that using the metadata is a very favorable artificial auxiliary way. There are stable long-term metadata records at meteorological stations in China, which are valuable observation background data. The factors influencing the quality of observation data and the processing methods are recorded, such as the weather and climate background, disastrous weather, and notes in addition to the basic information, site information, and observation equipment information at the stations. The first kind 3 of error in mathematical statistics can be reduced by the comprehensive analysis of the singular value combined with metadata, which is "to treat the truth as false" [22], so as to retain the real extreme weather events.

Refining indices based on extending traditional quality control methods
According to the historical data observed by the station, the historical extreme value inspection of the station is to calculate the extreme value and the standard deviation of the temperature. The temperature range of the station is determined by N times of the standard deviation [14,15,23]. As for the time consistency for the air temperature in China, the inspection is carried out in four areas [24]. The longest times for the air temperature to remain constant in four areas between 8 to 11. The data are wrong if the time exceeds the thresholds.

Method based on the blackboard model
The blackboard model was first proposed by Allen Newell, which is well-known for its application in the speech recognition system HEARSAY-II [25]. It is an intelligent processing technology, which can realize the heterogeneous central database of shared knowledge sources to comprehensively diagnose the meteorological data, meteorological knowledge, and experience [26] and distinguish the validity of the index for the observed data.

Wavelet threshold method
The wavelet analysis plays an important role in signal recognition and noise processing. The noise can be eliminated by threshold processing. According to the time-frequency analysis of nonuniform distribution, it is available to extract the transient and steady information from the non-stationary signal by the narrow window at the sudden change signal band (high-frequency) and the wide window at the slowly-varying signal band (low-frequency). The processing results directly reflect the variation of meteorological variables when applied to the meteorological data QC. The low-frequency part shows the intensity and trend of the variation, and the high-frequency part reveals the complexity of the variation. This method can be used to inspect whether the transient change of the air temperature at a certain time exceeds the limit [27].

Gene expression programming QC method
The Gene expression programming (GEP) algorithm is a new adaptive evolutionary algorithm proposed by Ferreira, a Portuguese scientist [28]. This method has the advantages of high efficiency and accuracy in the application, which uses simple codes to solve complex problems [29]. Sun [30,31] pointed out that the temperature and relative humidity are highly correlated at the bottom of the troposphere. The core of the GEP QC [32] is to establish a model based on the relationship between the air temperature and relative humidity to simulate the air temperature the next time. If the difference between the simulated value and the observed value exceeds a certain level, which is not caused by weather reasons, the data are wrong. The GPE QC method is better than traditional methods in error detection rate.

Integrated learning algorithm based on particle swarm optimization of phase space reconstruction and extreme learning machine
Ding [33] and Duan et al. [34] revealed that the atmospheric system is essentially a chaotic system with chaos and short-term predictability. Zhang et al. [35] proved this characteristic of the air temperature by the maximum Lyapunov index analyses of time series [36]. Based on the chaos of the atmosphere, as well as the continuity and stability of the air temperature in a short time, an integrated learning algorithm based on the particle swarm optimization of phase space reconstruction and extreme learning machine (PSO-PSR-ELM) is proposed for the QC of the air temperature. This algorithm can describe the dynamic characteristics of chaotic time series and can better predict the air temperature time series than before. The historical air temperature time series of the single station is reconstructed as the appropriate high vector space, namely the sample space, to simulate the value the next time. If the difference between the observed value and the simulated value is larger than the average level, which is not caused by meteorological factors, the data can be wrong. Zhang et al. [35] proved that this method is highly efficient to recognize errors, which also has the advantages of high stability, applicability, and flexibility.

QC methods of the air temperature at multiple stations
It is not enough to use the data at a single station for the QC. The data at adjacent stations provide more information for the QC of the inspected data. The spatial inspection methods such as the Madsen-Allerut method [41], the inverse distance weighting (IDW) method [42,43], the spatial probability method [44], and the interpolation method [45], spatial regression method [46] are widely used. In recent years, there are new researches on spatial inspection methods in China.

Second iterative spatial consistency inspection
Tao et al. [47] proposed the second iterative spatial consistency inspection in the QC process for the gross error of the automatic station data. The gross errors are the wrong data without meteorological meaning, which distribute uniformly within the range of the observation variables [48,49]. The second iteration is used to eliminate the uncertainty of gross error. That is, the data that fail to pass the first iteration are inspected by the second iteration, and the difference of the data that pass the first iteration is allowed so as to eliminate the influence of the wrong observation data on the surrounding data. This method solves the problem of the wrong data uncertainty.

Spatial difference inspection
As for the stability of the difference, Yin [50] found that the daily data of the meteorological variables that distribute spatial uniformly and are highly autocorrelated not only are stable but also conform to the normal distribution. According to the normal distribution model, the probability of a certain difference value between the inspected station and the reference station in the daily series is estimated. If the probability is less than the small probability α, it is considered to be a small probability event, which is wrong. According to the difference stability principle, the higher the correlation of a variable at two adjacent stations, the stabler the difference of the same variable at the two stations. Therefore, the criterion for the selection of the reference station is the maximum of the correlation coefficient within a certain range. The method is better in inspecting the single variable than the spatial consistency method [50], and it is not limited by the terrain.

Comprehensive interpolation method based on the linear regression
In the interpolation research of the observation data with multiple missing values, Wang et al. [51] determined the optimum values of the adjacent reference meteorological station function and sample temporal window by the sliding optimization method. Then, the linear regression model of the daily air temperature was established based on the observation station with missing records and the reference station. The minimum absolute error was selected as the objective function to obtain the model parameter method, instead of using the least square method with the minimum root mean square error as the objective function to obtain the model parameter method, which improves the parameter stability [52]. The interpolation of the standardized series obtained by the LAD and DE Gaetano methods is averaged for the comprehensive interpolation method of the air temperature missing value in a single day. The interpolation result which is the average of the interpolation values of the LAD method and the DE Gaetano method is excellent. The inhomogeneity of the interpolated series significantly influences the accuracy of the interpolation error. The solution of the LAD has the property of the median [53]. Therefore, this method is sensitive to the interpolation of the extreme value.

Temperature QC method based on empirical orthogonal function
Zou et al [54] and Qin et al [55] proposed a QC method based on the empirical orthogonal function (EOF) according to the criterion that the observation error and the background field error conforms to the Gaussian distribution in the data variational assimilation. The basis of the traditional QC method is the difference between the observation error and the background field error, while the EOF QC method is based on the difference between the observation error obtained by the EOF analysis and the corrected background field error. Wang et al. [56,57] used the double-weighted standard deviation of the observation increment to assess if the method was influenced by seasons, model resolution, and prediction skills. The data of the temperature in winter and summer and the background fields of the T639 and NCEP were used. The results show that the difference between the observation error and the background field error closely conforms to the normal distribution both in winter and summer, and the results in summer are better than that in winter. The larger the difference between the background field and the observation field, the higher the probability of the observation data being eliminated both in summer and winter. The probability to be eliminated of the NECP background field is higher than that of the T639 background field both in winter and summer, which is more significant in winter.

Improved Kriging method
Based on the spatial correlation and the continuity of the air temperature, Ye et al. [58] proposed the improved Kriging method based on the Gaussian model for the data QC. The Kriging method is a statistical algorithm for the unbiased and optimal estimation of regional variables based on the correlation and variabilities of the variables. The IDW method is cross verified with the Kriging method, and the weight is determined according to the distance between the adjacent station and the target station. The results show that the closer the distance is, the larger the weight is. Therefore, the distance between the adjacent station and the target station is considered in this method. Also, the adjacent station and the spatial distribution and correlation of the variables are considered. Compared with the two single methods above, the QC of this method has greater advantages in the inspection when the stations are sparse, which is stabler and more adaptable. The difference in the error inspection rate in different cities is small, as shown in Fig. 10.

Particle swarm optimization of multi-quadric equations fitting method
Zhang et al. [59] proposed the particle swarm optimization of the multi-quadric equations fitting (PSO-MEF) method based on the characteristics of the spatial stability and continuity of the air temperature. The method is applied to the QC of surface air temperature data, which becomes a spatial network QC method that combines the information at different adjacent stations together.
This method adopts the PSO-MEF to fit a smooth surface model according to the location information of the station and the temperature observation value of the adjacent station at a certain time. Input the location information of the inspected station into the model, and the estimated temperature of the corresponding time at the station can be calculated. If the difference between the observed temperature and the estimated temperature at the inspected station is greater than the average at that time, the value is regarded as an uncertain value. This method is a pure mathematical approximation. Therefore, it is highly adaptable to climate and regions, which is very effective in inspecting uncertain data.

Air temperature QC method based on the fuzzy c-means clustering
Liu et al. [60] studied the QC and the preliminary correction of the surface air temperature data in Anhui Province using the fuzzy c-means clustering (FCM) algorithm. Firstly, the FCM method is used to cluster the air temperature data and define the outlier velocity (temporal) and the outlier rate (spatial) to identify the outlying values in the air temperature data. Then the expert field model [61] is used to correct the outlying values. Because the air temperature information before and after the current time at this station and the adjacent station is referred by the FoEs, it can not only correct the outlying values but also effectively interpolate the continuous missing data. Because the air temperature of the station is distinguished according to the similarity of the air temperature in each QC of the FCM, the correct value which is very small or large is retained, which reduces the misjudgment of the extreme weather.

Conclusion
In recent years, great progress has been made on the air temperature data QC methods at single and multiple stations. In this paper, the air temperature data QC methods at the single station and multiple stations are summarized. At present, the methods such as data assimilation, statistics, and particle science are combined with the traditional QC methods, which improves the stability of the QC methods. The researches on the air temperature data QC should be carried out based on the characteristics of the air temperature. It is necessary to further introduce the methods to the QC of the variables such as the air pressure, humidity, and so on, which have similar characteristics with the air temperature.