Single and Multiple Imputation Method to Replace Missing Values in Air Pollution Datasets: A Review

Imputation plays an essential role in handling the issue of missing data. The conventional techniques applied to overcome this problem are single imputation (SI) and multiple imputations (MI). These statistical strategies have their strengths and limitations in replacing missing data. This article reviews the state of the art of imputation methods employed in general publications in replacing missing values for air pollution data. A comprehensive review of the literature identifies the use of SI and MI slightly increases over the year. This paper concludes on the trend and the approaches used in the imputation methods. Subsequently, this paper put forward the gaps in imputation technique that less utilized a machine-learning approach in providing a substitute for missing values in air pollution data. The future direction of the research is to extend more machine-learning approach with higher accuracy with higher performance in imputing missing values.


Introduction
Air pollution is harmful gases and excessive trapped particles [1], which increase the risk factor associated with respiratory illness and health problems. According to [1], six common air pollutants affect human health and the environment. These six common air pollutants are particulate matter (PM), ozone (O3), carbon monoxide (CO), sulfur oxides (SO2), nitrogen oxides (NO2) and lead. These pollutants are harmful and eventually become a real threat to the public's health. To overcome this issue, the researcher needs to evaluate, model and propose a prediction in the air pollution trends. Unfortunately, the air pollution data obtained from the continuous ambient air quality monitoring (CAAQM) station usually contained missing data. The justifications for such missing observations are due to machine failure, routine maintenance and human error.
Missing data is a prevalent problem in many scientific fields, including environmental research. It can cause serious problems that cause a significant difference in the research findings and hence lead to an incorrect conclusion [2]. Ignoring the missingness will lead to the situation of insufficient sampling, errors in measurements or faults in data acquisition [2].
Imputation plays an essential role in handling the issue of missing data. There are two methods in dealing with missing data: a single imputation method (SI) and multiple imputations method (MI). The researcher needs to carefully identify the pattern of missing data before engaging any suitable imputation methods. This pattern is to ensure that the data imputed from any purposes will produce unbiased results [7]. The varieties of missing data and their descriptions [8] are as depicted in figure 1. This paper provides a comprehensive review of the recent studies in different applications related to imputation; single imputation (SI) and multiple imputations (MI).

Research Methodology
In this paper, only the Scopus database is used as the leading search engine to find the relevant articles since Scopus is the database that covers more journals than the other services [35,36]. Figure 2 illustrates the number of documents available in the Scopus database for both SI and MI techniques used for replacing the missing values in air pollution data.  The number of documents for a single imputation method increased slightly in 2014 and 2019 with three papers each year. However, materials with MI technique recorded a higher amount than the SI technique. The database includes 12,599 documents for the 'imputation' search keyword itself.

Imputation Method
Imputation becomes a beautiful approach because it produces a complete data set by replacing missing data with substituted values. Two different types of imputation methods will review in detail in this paper; single imputation method (SI) and multiple imputations method (MI). Imputation can generate by using software packages such as SPSS, SAS, R, NORM, and many more.

Single Imputation (SI)
In a single imputation method, the imputed value is treating as the actual value [3]. It generates only one replacement value for each missing data point. According to [4], the resulting completed data set in the unique imputation method is used for inference proposes. Here comes the review in table 1.

Reference Year
Application SI Method [30] 2019 Estimating an hourly NO2 concentration  Spatial Interpolation  Linear Regression Models [31] 2019 Environmental contaminant on a health outcome  SI with a constant value  Likelihood-based estimation [16] 2019 Reconstruct incomplete air quality datasets  Mean imputation  Conditional mean imputation  KNN imputation [32] 2015 Functional data analysis  Means of curve estimation using a regression approach. [22] 2015 The compositional approach of leftcensored data  EM algorithm [24] 2014 Environmental epidemiological  Linear Regression Model (LM)  Partial Least Squares Model (PLS) [14] 2014 Air quality datasets  Listwise deletion  Unconditional mean imputation  Principal component analysis (PCA)  EM algorithm [25] 2013 Filling missing observations in short gap length.
 Mean imputation  Regression imputation  Stochastic regression imputation [36] 2008 Imputing missing value to annual hourly PM10 concentrations  Linear Interpolation  Mean imputation [33] 2006 Imputing missing value on multilevel structure  SDEM method

Multiple Imputation (MI)
Multiple imputations impute each replacement value with different plausible estimates of the missing data points. Since the SI ignores ambiguity and usually underestimates the variance, therefore the MI is more preferred than a SI which serves as a solution for both within-imputation ambiguity and between-imputation ambiguity. Here comes the review in table 2.  [18] 2018 Ozone  Multiple regression techniques  Artificial Neural Network Model [19] 2017 Estimating daily PM2.5 concentrations  MI -the combination of MAIAC and CTM [20] 2017 Serum concentrations of dioxinlike compounds  MI survey analysis method [21] 2015 Chemical speciation  Model-based MI [22] 2015 Left-censored data under a compositional approach  ML [23] 2014 Spatio-temporally data  MI  Random forest [24] 2014 Environmental epidemiological  MI (through LM and PLS)  Bayesian approach [14] 2014 Complex missing data in air quality datasets  MI [25] 2013 Filling missing observations in short gap length. [26] 2013 Environmental epidemiological  MI [27] 2007 Air toxics  MI  Optimal linear estimation [33] 2006 Imputing missing value on multilevel structure  Model-based MI [28] 2004 Environmental health  MI -regression [29] 2001 Incomplete multivariate time series

Discussion
Either SI or MI, both have their strength and weaknesses. In choosing a suitable method, data must be in line with the varieties of missing data either MCAR, MAR or MNAR. Table 3 is a summary of the advantages and disadvantages of SI and MI methods in handling missing values [14]. Table 3. Summary of pros and cons by SI and MI method [14].

Method
Advantages Results in table 1 and table 2 provide an insight into the use of SI and MI. From the list, the applications of imputation techniques characterized into statistical analysis (SA) approach (i.e. linear interpolation, regression-based imputation) or/and machine learning (ML) approach (i.e. ANN).  Figure 3 shows the application of imputation techniques using statistical analysis (SA) approach and machine learning (ML) approach in replacing the missing values in air pollution data. Majority of the articles applied SA approach for both imputation methods, whereas only one piece by [17] using the ML approach in MI but none in the SI method. The findings will be of interest to enhance research work in the application of ML since ML is useful when dealing with big data. A total of three articles [5,18,30] utilized both approaches in imputing missing values. These findings provide the insights for future research to publish more articles in imputation using both methods where this combination has eventually come out with a hybrid model that is recommending highly in solving missing value problem, especially to the long gaps. Currently, there are three articles [37][38][39] in the environmental area use a combination of the hybrid method to solve for imputing missing values based on Scopus database.

Conclusions
The application of single imputation method (SI) and multiple imputation method (MI) has increased over the last decade. Along with the application of hybrids and machine learning, the process of imputing the missing value shows better result. Through these methods, the main objective of researchers is to get higher accuracy and efficiency.