Visualization of Multivariate Time Series pollutant variables in Malaysia

Visualization and exploratory analysis is a crucial preliminary part of any data analysis process. Several visualization approaches have been introduced to evaluate the behaviors of time-dependent data. However, the visualization technique tends to be challenging when the data are high-dimensional and voluminous. Environmental data such as pollutant variables are often collected in multi-variables form and over time, resulting in a form of multivariate time-series data. To deal with this issue, this study provides several graphical approaches and methods which include the plots of multiple individually on a time-series, correlation matrix visualization and smoothing multivariate time-series. A case study involving data on air-pollution variables in Klang, Malaysia have been analyzed. The results found the all the methods able to provide an informative visualization on the behavior of multivariable time series of pollutant data.


Introduction
Multivariate time-series data are a type of data derived from a combination of related variables of any phenomenon being studied. At present, multivariable time-series data are available in various fields of study, particularly in science, engineering, and business. Thus, a good visualization technique should be adopted to provide good preliminary insights into the behavior of multivariable time-series data. In general, the visualization approach aims to exploit the phenomenal abilities and capabilities of the human eye to detect the structures and arrangement in images [1].The visualization of time-dependent series data has a long history. According to Tufte [2], time-series plots were performed for the first time in the 10th or 11th century. They appeared in an illustration of planetary orbits in a text from a monastic school. In science, time-series charts were rediscovered by Lambert around the 17th century. Lambert [3] applied line graphs to present the periodic variation in soil temperature in relation to depth under the surface. At present, a wider repertoire of methods is used and applied to visualize time-dependent data in many areas.
The data on air pollution in Malaysia consist of five main variables, namely, carbon monoxide (CO), nitrogen dioxide (NO2), oxide (O3), particulate matter (PM10), and sulfur dioxide (SO2). Several researchers have analyzed and visualized the multivariable pollutants in Malaysia using different approaches. For example, Al-Dhurafi et al. [4], Sánchez-Balseca and Pérez-Foguet [5] used compositional approach in which the proportion of each pollutant is considered as a component of compositional data. Masseran and Safari [6], Martins et al. [7], Masseran and Mohd Safari [8] modeled and visualized the behavior of rare occurrences of air pollution indices in term of Extreme-Value model. Al-Dhurafi et al. [9] described and visualized the multivariable pollutants in terms of its distributional form, which was derived from its structure and descriptive status. Chen and Wu [10], and Masseran and Safari [11] evaluated and visualized the fluctuation of pollutant variables which corresponds to five determinant states of the Markov chain. Kumar and Ridder [12], and Masseran [13] modelled and visualized the fluctuation of PM10 data using ARIMA-GARCH model. Al-Dhurafi et al. [14] analyzed and visualized the dependency of multivariable pollutants corresponding to their unhealthy air-quality status using generalized Pareto distribution and Pickands dependence function plots. Masseran and Hussain [15] modelled and visualized the dependency fluctuation among the air pollution variables using dynamic Copula approach.
In this study, we present alternative visualization techniques to analyze multivariate time-series data. In Section 2, we describe the study area and main components of the variables involved. An overview of the visualization techniques for this air-pollution field are plotted and discussed in Section 3. Section 4 concludes the study with general explanations and remarks on using the described techniques and suggestions for further research.

Data and area of study
The air-pollution index data were obtained from the Department of Environment, Malaysia from January 1, 2007 to December 31, 2016. This study focuses on Klang, a large city with land area of approximately 573 sq. km. As an urban and industrial region, Klang is densely populated. It is recognized as the 13th busiest transshipment port and 16th busiest container port in the world. The major industrial activities in Klang are importation and exportation. However, due to industrial activity and rapid economic growth, the city's air quality can be affected negatively if not monitored properly [16]. Thus, this study proposes a multivariable visualization as an alternative tool for monitoring pollutant behavior in Klang. The data has a small percentage of missing values at random points. Thus, to estimate these missing values, the method of single imputation based on the average of the last known and next known observations is used. This method is easy to implement and can provide good results for missing data with random behavior [17].

Visualization techniques for multivariable air-pollution data
The multivariate plots are useful in visualizing the relationships or interactions among attributes. The purpose of this plot is to learn about the trends, correlations, and distributions among the variables. Furthermore, using the bivariate scatter plot is important to determine the spread over groups of data, typically for a pair of variables. Both of these plots are available in R programming software.

The function of autoplot ( )
The autoplot( ) function from ggplot2 package was used to create a ggplot version. Previously, we transformed the mts object into a zoo object through the as.zoo function. The present study illustrates the use of the autoplot( ) function to plot the trends of multivariate time-series data of pollutant variables in Malaysia. By producing standard time-series plots of the data, we can conduct visualizations either simultaneously on a single plot or individually on a time-series plot. For example, figure 1 illustrates 10 years of daily air-pollution data on five main pollutant variables (CO, NO 2, O3, PM10, and SO2) in Klang.
In addition, figure 2 shows the trends of overall variables in a single plot. It does not present the empirical distribution of the data, the correlation among the variables, or the significance levels of the series. As shown in figures 1 and 2, most of the air-pollution data fluctuate around their means. However, for the data on variables CO, PM10, and SO2, several ''shock points'' or outliers are distant from the mean. This result implies that for a particular period, the volatility effect occurred in the CO, PM10, and SO2 data. Although figures 1 and 2 indicate the simultaneous fluctuations of multivariable pollutant data, these graphs lack information on the behaviors of time series.

Correlation matrix visualization
This section presents the ggcorrplot( ) function for plotting the correlation analysis among the variables. A correlation matrix describes the correlation coefficients among the variables in the form of a table. Usually, a correlation matrix is presented in square form, as shown in figure 3 that the columns and rows with the same variables. Figure 3 presents the visualization of the correlation matrix between the variables. The line of 1.00 s is the main diagonal, which goes from top left to bottom right. This line indicates that each variable is perfectly associated with itself. The value of the correlation shown above the main diagonal is a mirror image of those below the main diagonal, which implies the symmetrical matrix. In statistics, the value of correlation coefficient, r is always between +1 and -1. Table 1 presents the range of the r value, which is used to interpret the correlation coefficient.

Correlation matrix corresponds to their distribution plot
This section illustrates a correlation matrix that corresponds to a distribution plot. The function chartCorrelation( ) based on PerformanceAnalytics package, which can be used to generate this plot. Figure 4 illustrates the chart correlation of five main pollutant variables in Klang.
overall level of the series. This figure provides a visualization of the variation that presence among the five compositional pollutant variables; CO, NO2, O3, PM10 and SO2.The figure also detecting some clustered events in term of a high, medium, and low values on each pollutant variable. However, this information is not clearly explained in figure 5. Thus, to overcome this problem, a smoothing technique need to be adopted.

Smoothing multivariate time-series data
Smoothing technique is a useful visualization that helps to spot trends in noisy data and compare the trends between two or more fluctuating time series. Although the smoothing does not provide a model, it can be a first step in describing several components of the series. When the plot is slightly noisy, it can be simplified by smoothing the individual series. This plot provides an idea of how the data are organized along the time axis. Figure 6 shows the smoothing multivariate time-series data in Klang. Based on figure 6, the variation that among the five compositional pollutant variables accompanied with their clustered events in term of a high, medium, and low values are clearly represented. On the other hand, the mvtsplot( ) function has an option called smooth.df argument, which can be used to apply a natural spline smoother in the time-series matrix for each of the time series. The smooth.df argument specifies the number of degrees of freedom to be used in the natural spline smoother for each time series. Each observed value is substituted by its fitted value once the smoother is fit to the data and then plotted. the smoothing trend line indicates that the level of air quality influenced by these clustered events.

Conclusion
This study has successfully presented several visualization methods for multivariate time series that are useful for exploratory analysis prior to formal model fitting. Several steps of data visualization on multivariable air pollutants data has been proposed. First, plot for a trend of pollutant variables by using autoplot( ) function on a plot individual graph and a single plot. Next, ggcorrplot( ) and chartCorrelation( ) function have been propose to be used to plot the correlation matrix and its distribution. Apart from that, the use of mvtsplot ( ) function has been proposed in order to reveal the underlying patterns in the data by using color, and smoothing argument was added to apply a natural spline smoother in the time-series matrix. This study revealed and provided the visualization function and applied it to ambient air-pollution data. For future study, we encourage interested users to apply their own functions and modifications to the existing code to suit their own research purposes.