Review of missing values procession methods in time series data

Missing values is common problem for a lot of time series. Lack of data can be caused with human factor, technical problems non-working measuring stations and so on. Usual methods of handling missing values in time series data suppose that there are models of time series that can make predictions at period one needs to describe. To build them it’s necessary to have data of some time lapse before the period under investigation. Inside of this set of data there shouldn’t be any missing values. So, ordinary approach supposes that there’s a lot of data before the period under question. In this research it’s supposed that missing values can be situated in time series data at any time point. Thus, there’s no whole uninterrupted segment of time series that can be used to train models. Missing values in these time series must be handled first and only after that it’s possible to construct time series mathematical models and make forecasts. At this stage one can evaluate quality of constructed models and whether handled missing values fit known data.


Introduction
Support of modern metrological systems often require procession of time series. This type of data usually includes information about sequential measurements of some value under investigation. Often some time values can be wrong or not available. So, problem of missing values procession can be met at any domain of knowledge. Usual approach to time series forecasting supposes construction of mathematical models trained at time series data without missing values. Thorough analysis and construction stages of ARIMA models and other well-known mathematical constructions are available in [1 -3]. But this approach is inappropriate if there are missing values in time series data. Before the stage of tie series construction starts it's necessary to handle all missing values. In this paper review of methods to deal this problem is presented. Today approaches to this task are based not only on regression and interpolation methods [1 -3] but they use bootstrap aggregating methods [4], principal component analysis used in multidimensional tasks [5]. Also this problem appears in sight if values of time series are measured with different time steps between them [6]. Models for further analysis of the processed time series models can be found in [7]. Time series procession can be implemented not only in economical mathematical models construction [7] but, for example, in medicine [8] or educational problems research [9]. There are research papers showing that these problems can be solved with gamebased methods [10,11].

Methods
Time series values handling methods in this research include polynomial and spline interpolation methods, regression methods, time series construction and neural networks implementation. First of all, the simplest way to handle single missing value it to use values of the closest neighbours. In the formula (1) one can see that absent value ts(t) is obtained from linear expression involving previous value ts(t-1) and the next one ts(t+1): Another approach that should be mentioned is implementation of the linear regression model trained with use of ordinary least squares method [1 -3]. The both methods don't use any specifical knowledge about time series behaviour and can be used for various tasks. The main problem of this method is construction of single line for the whole dataset. If this method is used locally and uses only local part of time series data its quality decreases. At the same time it can be used to manage a lot of missing values, not only single misses.
Interpolation methods can also be used either with whole time series data or with its part near certain time point under investigation. Polynomial interpolation methods have got well-known problems such as Runge's phenomenon [12].
To handle such problems one can use spline interpolation methods [1 -3] of lower degree than polynomials used in case of interpolation. Spline functions use only local information and they aren't influenced with data of high variance periods which are situated far from the time point under investigation. Usually cubic splines are implemented. They are free from the Runge's phenomenon problem. For example, one can read it more thoroughly in [13]. In the experimental part of this work natural cubic splines are used.
Autoregressive models (AR) that are included into complex ARIMA model [1] can also be implemented to use in such cases. Segments of time series without missing values are used as train set to evaluate coefficients in AR or ARIMA model. Missing ones are calculated as forecasts of such models. It's necessary to have enough data without misses to construct forecasting models. So, if there's a lot of missed values one have to use cubic splines or other methods.

Experiments
Methods enumerated above have been tested at currency exchange rate time series (U.S. dollars and Russian rubles) [14]. There are two types of experiments: procession of single missing values and randomly situated series of misses procession.

Single missing values procession
Behaviour of the source time series of daily currency exchange rate U.S. dollar / Russian ruble for dates between 2020-01-01 and 2020-04-09 is shown at the figure 1.  Single values are deleted randomly and they're situated separately. They're marked as black dots at the figure 1. Among approximately 250 values there are twelve misses. Missed values are handled with methods presented in the first column of the table 1. Then processed time series is compared to the source one with RMSE metrics [2,3] shown at the expression (2): Here τ(t) denotes value of the processed time series, ts(t) is value of the source one, t enumerates all handled time points and N is their quantity. Results are also clearly seen at the figures 2 -6. If one observes time series with low volatility the best ways to handle single misses are supposed to be "closest neighbours" and cubic splines. They haven't got Runge's problem and usually work well. Autoregressive models also show good results but they require a lot of points to evaluate coefficients. The main problem of the linear regression method is clearly seen at the figure 3. There's single line for all misses. It would be better to construct lines for local surroundings of misses. Basically this idea is foundation of cubic splines.

Procession of missing values series
The same kind of experiment handling series of misses has been made. Number of sequential misses is random but it's less than 5.
Behaviour of the source time series of daily currency exchange rate U.S. dollar / Russian ruble for dates between 2020-01-01 and 2020-04-09 is shown at the figure 7. Values that are going to be removed are marked with black dots.