Determining the Rolling Window Size of Deep Neural Network Based Models on Time Series Forecasting

Time series forecasting has always been a significant task in various domains. In this paper, we propose DeepARMA, a LSTM-based recurrent neural network to tackle this problem. DeepARMA is derived from an existing time series forecasting baseline, DeepAR, overcoming two of its weaknesses: (1) rolling window size determination: the way DeepAR determines rolling window size is casual and vulnerable, which may lead to the unnecessary computation and inefficiency of the model;(2) neglect of the noise: pure autoregressive model cannot deal with the condition where data are composed of various kinds of noise, neither do most of time series models including DeepAR. In order to solve these two problems, we first combine a classic information theoretic criterion, AIC, with the network to determine the proper rolling window size. Then, we propose a jointly-learned neural network fusing white Gaussian noise series given by ARIMA models to DeepAR’s input. That is exactly why we name the network ‘DeepARMA’. Our experiments on a real-world dataset demonstrate that our improvement settles those two problems put forward above.


Introduction
Time series forecasting has a pivotal role in many application domains, including stock market prediction [1], demand or supply forecasting [2], disease propagation analysis [3]，etc. Due to its importance and practicality, a great many different methods have been developed aiming to solve the problem.
Up to now, existing methods related to time series forecasting can be divided into two groups: conventional methods and modern methods. Conventional methods including ARIMA [4], which is one of the most dominant models, are mainly based on statistics and mathematics. They normally have theoretical guarantees, such as Box-Jenkins methodology [5], exponential smoothing [6] and state space models [7]. There also exist methods inspired by traditional machine learning algorithms like support-vector machine [8], hierarchical Bayesian methods [9]. Though conventional methods take advantage of strong interpretability, they suffer from linear assumption, leading to their poor performance in non-linear situations.
Modern methods are mainly based on deep learning and neural networks. Some methods use deep learning approaches to enhance conventional methods. DeepAR [10] builds an LSTM-based [11] auto-regressive RNN to solve the probabilistic forecasting problem. Deep SSM [12] offers the state space model ability to learn complex patterns by combining deep jointly-learned recurrent neural network. In recent years. the introduction of transformer [13] allows the direct pattern extraction from  [14] introduces a time series forecasting baseline by applying an improved transformer with LogSparce self-attention to shrink the calculation. To solve long sequence time-series forecasting (LSTF) problems, Informer [15] proposes a ProbSparse self-attention mechanism and self-attention distilling operation to reduce the space and time complexity of the model so that the model can deal with longer input.
However, all these models fail to consider the factor of the white noise. Additionally, it has been demonstrated that the rolling window size fundamentally affects the performance of the model. However, rolling window size selection of current models is often casual and unreliable, which may trigger the inefficiency of the model. To solve these problems, we propose DeepARMA, a LSTM-based RNN developed from DeepAR. Our contributions are three fold: 1.We propose a method to determine the rolling window size by combining deep neural network with conventional models to make it interpretable and reliable.
2.DeepARMA takes white Gaussian noise into consideration and treats it as one of the covariates. According to this, DeepARMA performs better than DeepAR especially when time series are mainly composed of noise. In order to acquire white Gaussian noise series, ARIMA is put into use to obtain the variance of the white Gaussian noise.
3.We have successfully applied DeepARMA to time series forecasting with noise and performed experiments on a real-world gyroscope static drift dataset. Results show the validation of the improvement in DeepAR.

Problem definition
The problem definition is briefly introduced in this section.    x can be univariate or multivariate. In this way, we are able to model the following conditional distribution. (1) Note that t 0 and T are learnable parameters respectfully standing for the rolling window size and predicting steps.

DeepAR
Our model, DeepARMA, is developed from an existing autoregressive LSTM-based network--DeepAR [10]. We emphasize that we are not claiming to invent a new autoregressive recurrent architecture but simply propose a network derived from DeepAR which achieves better generalization performance. We will briefly introduce its core architecture and refer readers to [10] for more details.
DeepAR assumes that the model distribution is composed of the multiplication of likelihood factors

Determining the proper window size
The sliding windows size determines the complexity of the model to a large extent especially during training the network. Little attention had been paid to it including DeepAR. In our approach, we apply AIC[16], a classic information theoretic criterion and an estimate of a measure of fit of the model, to the determination of the proper window size. AIC is defined by (3) where l refers to the maximum likelihood and k refers to the number of independently adjusted parameters within the model. However, it is still a tough work to estimate the model with every possible window size in the limited range. We address this issue by firstly using ARIMA to make the prediction and computing the corresponding smallest AIC(p, q) which stands for the most appropriate value of AIC for the model with AR order p and MA order q. On this foundation, the rolling window size could be chosen due to the maximum of p and q. Despite ARIMA is a conventional model based on linear assumption, it can at least give the minimum of the proper window size. In our study, we will take twice the maximum of p and q merging with the prediction length as the default rolling window size.

Replacing AR with ARMA
In most of prevalent modern time series forecasting models, no matter which kind of architecture they are based on, CNN, RNN or Transformer, hardly ever do they consider the effect of the white Gaussian noise. It matters a lot when the input of the time series are mainly composed of combination of the white Gaussian noise where MA model is more likely to match. Therefore, instead of building an autoregressive model like DeepAR, we contribute DeepARMA, an autoregressive moving average LSTM-based recurrent neural network.

Figure1 Summary of DeepARMA
As is shown in Figure1, the network receives the inputs, including the previous values of the time series z, covariates x and white Gaussian noise ξ when the other parts mirror the architecture of DeepAR. Notice that the variance of the white Gaussian noise ξ is given by the ARIMA(p,q) model which is also used to determine the rolling window size.

Datasets and environment
We perform experiments on a real-world dataset. This gyroscope static drift dataset consists of 57936 groups of data collected from a static silicon micro-gyroscope and accelerometer. Each group of data includes recorded time, 3-axis acceleration and 3-axis angular velocity. It should be mentioned that these data ought to be recorded at a fixed interval of 5 ms. However, it might slightly change from time to time due to the restrictions of experiment equipment. We evaluate our model only using z-axis's senor data owing to the assumption that 3-axis sensor data are independent from each other. Notice that these sensor data are supposed to be mainly composed of various kinds of noise.
We have selected two time-series forecasting methods as comparison, including ARIMA and DeepAR. We implement our model using Pytorch and all the models were trained/tested on a single Nvidia GeForce GTX 1050. Both models contain a 3-LSTM-layer stack and are optimized with Adam optimizer whose learning rate is 1e-6. The total number of epochs is 40 for both and the batch size is chosen as 32.

Experiments
We first show the procedure to determine the window size and how effective it is to choose a proper window size. Above all, we use ARIMA model to make the prediction in order to find out the the corresponding smallest AIC(p, q). The result of the calculation of AIC(p,q) is shown in Table 1 As is shown in Table1, AIC(2, 6) has the smallest value, implying that the rolling window size should be over than 6. Thus, assuming that the prediction length is also twice the maximum of p and q, we determine the default rolling windows size as 24.
Further, we compare with other two baselines in the condition of different rolling windows size, including the default one. We consider ND(Normalized Deviation) as the criterion to judge the accuracy, which is defined as (4)