MLGN: multi-scale local-global feature learning network for long-term series forecasting

Although Transformer-based methods have achieved remarkable performance in the field of long-term series forecasting, they can be computationally expensive and lack the ability to specifically model local features as CNNs. CNN-based methods, such as temporal convolutional network (TCN), utilize convolutional filters to capture local temporal features. However, the intermediate layers of TCN suffer from a limited effective receptive field, which can result in the loss of temporal relations during global feature extraction.To solve the above problems, we propose to combine local features and global correlations to capture the overall view of time series (e.g. fluctuations, trends). To fully exploit the underlying information in the time series, a multi-scale branch structure is adopted to model different potential patterns separately. Each pattern is extracted using a combination of interactive learning convolution and causal frequency enhancement to capture both local features and global correlations. Furthermore, our proposed method,multi-scale local-global feature learning network (MLGN), achieves a time and memory complexity of O(L) and consistently achieve state-of-the-art results on six benchmark datasets. In comparision with previous best method Fedformer, MLGN yields 12.98% and 11.38% relative improvements for multivariate and univariate time series, respectively. Our code and data are available on Github at https://github.com/Zero-coder/MLGN.


Introduction
Time series forecasting (TSF) allows for the prediction of future events by analyzing past patterns in time series data.It plays a crucial role in a wide range of scientific and engineering fields, including health [1,2], energy [3][4][5], traffic flow [6,7], weather forecasting [8][9][10], financial investment [11,12]and remaining useful life prediction for industrial machines [13,14] , among others.By providing insights into future trends and patterns, TSF facilitates decision-making and enables proactive measures to be taken.The accuracy and reliability of TSF models can greatly impact the effectiveness of planning, risk management, and resource allocation in various domains.Particularly, long-term TSF is highly demanded in real-world applications, and this paper specifically focuses on this task.The problem of long-term series forecasting(LTSF) can be described as follows: Assuming the size of the retrospective window is L, we set the input at timestamp t as X t−L+1:t = {x t−L+1 , . . ., x t }, and define the output of prediction as X t+1:t+T , where T is the length interval of the predicted output(T >> L).
As shown in figure 1, there are three main types of deep learning models commonly used for sequence modeling and TSF: recurrent neural network-based models (RNNs) [15][16][17][18][19][20], Transformer-based models [21][22][23][24][25], and CNN-based methods [26][27][28][29][30].While RNN-type methods have demonstrated impressive performance [24], they are often limited by the problem of gradient vanishing or exploding, which significantly hampers their performance.Transformer-based methods have been proposed as a solution to learn long-term temporal correlation in TSF tasks and have shown successful results.Transformers are capable of capturing temporal dependencies among time points thanks to their attention mechanisms.However, using the attention mechanism to compute global correlations in long-term series results in quadratic complexity.Furthermore, transformer-based methods lack the ability to specifically model local features as CNNs [30], thus making it relatively inappropriate for larger datasets in real TSF application.temporal convolutional network (TCN) is a CNN-based model that uses 1D causal dilated convolution to extract local temporal features from time series data.It captures long-term dependencies by stacking many causal dilated convolution layers, which greatly increases the complexity of the network and the training difficulty of the model.Additionally, the valid receptive range of middle layers are constrained, leading to the information loss during temporal feature extraction.
Therefore, an effective forecasting method should possess the following two characteristics: (1) the ability to extract both local and global features (2) low time and memory complexity.
To achieve the mentioned above, we propose a multi-scale local-global feature learning network(MLGN).We employ a multi-scale sequence decomposition module to partition the series into distinct trend and seasonal components.These two distinct components are indicative of the extended-term development and seasonal patterns of the time series, respectively.For the trend part, we use a simple linear layer to make its prediction.For the seasonal part, we use multiple branches of local-global layers with different scales to independently model different potential temporal pattern information of the seasonal component.In each branch, we extract local features of the sequence using an interactive learning convolution (ILC)-based local module, and model the global correlation by employing a causal frequency enhancement(CFE)-based global module.Finally, Fuse operation is utilized to merge information about different temporal patterns from diverse branches.Our proposed method achieves linear complexity in both time and memory usage.MLGN achieves cutting-edge accuracy on various real-world datasets.The contributions can be outlined as follows: • To utilize the complex temporal information of the prospective long-term horizon, we propose MLGN as a multi-scale branch architecture with the multi-scale sequence decomposition module for seasonal-trend decomposition.Notably, MLGN exhibits linear complexity in both time and memory usage.

TSF models
Classical methods, such as ARIMA [31] , SARIMA [32] and Holt-Winters [33], provide theoretical assurances for their performance [34].Nonetheless, these methods are predominantly suited for univariate forecasting issues, thereby limiting their applicability to intricate time series data.In light of the growing availability of data and computing power, recent studies [35,36] indicate that deep learning-based TSF techniques hold potential to yield more precise forecasting outcomes compared to traditional methods.Deep learning techniques, such as RNNs, has demonstrated superior performance in TSF when compared to traditional methods.However, RNNs are widely used for modeling sequential data, their recurrent nature can create complications in accurately capturing long-term dependencies.In particular, the well-known difficulties associated with gradient vanishing or exploding, coupled with the inefficiencies of training procedures, can substantially restrict the practical usefulness of RNNs.Shortly afterwards, Transformer [21] arises and exhibits significant capability in sequence modeling and achieves outstanding results in many domains, including TSF.To address the issue of quadratic complexity in the attention mechanism, Informer [22] introduces the ProbSparse self-attention mechanism to lower the complexity of conventional self-attention approach, resulting in a computational complexity of O (L log L).Autoformer [24] introduces an innovative auto-correlation mechanism as a substitute for the conventional self-attention approach.This enables the formation of series-wise connections, resulting in a computational complexity of O (L log L).Fedformer [37] designs two frequency-enhanced attention modules based on fourier and wavelet transforms, respectively.Both modules are used to replace self-attention and cross-attention modules, with a time complexity of O (L).These Transformer-based models have demonstrated outstanding performance in learning long-term sequence dependencies [38].CNN-based methods are commonly used to extract local temporal features by leveraging convolution kernels.In TSF field, TCN [39] employs causal convolution to account for temporal causality and dilated convolution to expand the receptive field.This enables it to better integrate local information of a sequence and yield competitive results for short and medium-term forecasting [40,41].However, due to the limitations of the receptive field size, TCN often requires a greater number of layers to capture the global relationship of time series, resulting in increased complexity of the network and heightened training difficulty of the model.In this paper, we propose MLGN, which specializes in modeling the local features and global information of time series.To achieve this, we design a down-sampling ILC for local feature extraction, as well as a CFE module for global correlation discovery, as detailed in the following sections.

Modeling local and global features for long-term TSF
Feature modeling requires the consideration of both local and global relationships, as both are important factors in accurately capturing the underlying patterns in the data.Some works have been studies exploring how to integrate local and global feature modeling in a general way to achieve high efficiency and interpretability.Three commonly known architectures that combine local and global feature modeling are: CNN-LSTM [42,43], dual attention network (DANET) [44] and Lite-Transformer [45].
The combination of CNN and LSTM effectively harnesses the local feature extraction capabilities of CNN and the ability of LSTM to capture long-range temporal dependencies, thereby enhancing the performance of predicting and analyzing sequential data by capturing both local and global features.However, such approaches still suffer from the inability of LSTM to parallelize the processing of data features, resulting in time-consuming training.While the gradient problem in RNN has been mitigated to some extent in LSTM and its variants, it still poses significant challenges when dealing with longer sequences.
DANET is a dual attention network that flexibly model the correlation between local and global features and has achieved cutting-edge performance in many segmentation applications.It adopts multiple dilated convolution to capture local position features.DANET incorporates a position attention module and channel attention module to capture long-range dependencies across spatial and channel dimensions, respectively.The position attention module conditionally merges local features from each position by weighting all position features.Corresponding features are correlated with each other despite their distance.On the other hand, the channel attention conditionally highlights interdependent channel features by integrating correlated features among all channel maps.However, DANET lacks detailed analysis of the learned local and global features and their impact on the model's output.Another limitation of DANET is the cubic complexity with respect to the sequence length due to positional attention mechanism.
Lite-Transformer is another architecture that integrates local and global feature modeling.It adopts convolution to extract local information and self-attention to capture long-term correlation, but separates them into two branches for parallel processing.The paper also presents a visual analysis of the feature weights extracted from the two branches, which can provide a good interpretation of the model results.However, the parallel structure of the two branches may lead to redundancy in computation, and the model still suffers from the limitation of quadratic complexity with respect to the sequence length due to self-attention mechanism.
Despite their effectiveness in modeling both local and global features mentioned above, the quadratic or even higher computational complexity of these architectures limits their practical applications in many real-world scenarios.To address the aforementioned limitations, we propose a novel framework for modeling both local and global features in time series data.Rather than relying on attention mechanisms, our approach introduces a new module that utilizes ILC operations to extract local information.Additionally, we propose the CFE module to model the global correlations between different branches of the local features.Our proposed method not only achieves state-of-the-art results but also reduces the overall temporal and memory usage to a linear complexity in relation to the length of the sequence.

Model
We design a novel network architecture for TSF (section 3.1), as illustrated in figure 2. Our approach begins with a multi-scale sequence decomposition module (section 3.2), which partitions the sequence into distinct trend and seasonal components, as shown in figure 3.These two distinct components are indicative of the extended-term development and seasonal patterns of the time series, respectively.For the seasonal part, we use multiple branches of local-global layers with different scales to independently model different potential temporal information of the seasonal component (section 3.3), as shown in figure 4. In each branch, we extract local features of the sequence using an ILC-based local module (figure 6(a)), and model the global correlation by employing a CFE-based global module (figure 7).For the trend part, we use a simple linear layer to make its prediction (section 3.4), as shown in figure 8.

MLGN Framework
The overall structure of MLGN is shown in figure 2. Inspired by traditional time series decomposition algorithms [46] and deep learning models [24,30,37,38], we design a multi-scale sequence decomposition (MSDecomp) block to separate complex patterns of input series.Then we utilize seasonal component prediction module to predict seasonal information and trend component prediction module to predict trend information respectively.Then add the prediction results up to get the final prediction Y pred .We define c as the number of variables in a multivariate time series.Further elaboration will be provided in subsequent sections.

Multi-scale sequence decomposition
Previous algorithms for series decomposition, as outlined by [24,38], utilize a moving average technique to mitigate periodic oscillation and accentuate long-range tendencies.For a time series X ∈ R I×c , the procedure is as follows: where: X trend , X seasonal ∈ R L×c denote the trend and seasonal parts, respectively.The purpose of using the Avgpool(•) operation with padding is to keep the length of the time series unchanged.However, the parameter kernel of the Avgpool(•) operation is artificially predetermined, resulting in significant disparities between trend and seasonal series obtained using different kernels.Furthermore, due to the intricate periodic patterns that are commonly observed in real-world data, along with the trend component, extracting the trend using average pooling with a single fixed window can be challenging.To address this issue, we propose a multi-scale sequence decomposition block.This block comprises a series of average pooling kernels with varying sizes, allowing for the extraction of different trend components from the original input.Additionally, the method employs a group of data-dependent weights to amalgamate the extracted trend components into the final trend.Concretely, for the original input X ∈ R L×c , the procedure is as follows: where X is the input sequence, X ∈ R L×c .S(•) is a group of average pooling kernels, and S(X) = (AvgPool(Padding(X) kernel1 ), . .., AvgPool(Padding(X)kernel n )).LogSoftmax(M(x)) represents the scaling factors used for blending the trend components.
The different sizes of kernels are consistent with the multi-scale information in the Seasonal Component Prediction block.The effectiveness of multi-scale sequence decomposition is demonstrated experimentally in (section 4.4.1).

Seasonal component prediction block
The trend component typically presents a smoother changing trend, which makes forecasting of the trend component comparatively easier(section 3.4).However, the seasonal component, due to its rapid fluctuations over the short term and stochastic nature, requires higher resolution and more sophisticated modeling techniques [47].As illustrated in figure 4, the Seasonal Component Prediction Block is dedicated to the intricate modeling of the seasonal component.We utilize multi-scale local and global(MLG) feature extraction module to capture and learn both local features and global correlations, with each branch at a   different scale modeling distinct underlying patterns of the time series.The outputs from these branches are fused to achieve comprehensive information utilization of the sequence.It can be summarized as follows: where X zero ∈ R T×c denote the placeholders filled with zero, and Y seasonal,0 ∈ R (L+T)×c denotes the expanded representation of X seasonal .Y seasonal,l ∈ R (L+T)×c represents the output of layer from MLG feature extraction module, and Y seasonal ∈ R T×c represents the final prediction of the seasonal component after a linear projection with Y seasonal,N ∈ R T×c and Truncate operation.The detailed description of MLG is given as follows.
MLG Feature Extraction Module: This module is composed of several branches with different scale sizes, which are utilized to capture potentially diverse temporal patterns.One notable benefit of this design is that every MLG layer can access both the local and global perspectives of the whole time series, which aids in extracting valuable temporal characteristics.In each branch, as illustrated in figure 5, the local-global module is responsible for modeling local features and global correlations of the series.Concretely, in the local module, we design an ILC module (figure 6) that performs down-sampling on input series or features, generating two sub-series [48].Each sub-series is subsequently processed by a group of convolutional kernels designed to discover distinctive and significant temporal features.To mitigate loss of information when  downsampling, an interactive learning mechanism is integrated between the two subsequences.Further elaboration about the ILC module will be provided later.The process of local block is as follows: where Y seasonal,l−1 denotes the output of (l − 1)th MLG layer.Scale i ×c represents the result obtained by compressing local features, which is a short sequence.
The ILC module splits the input sequence X into two distinct sub-sequence, X odd and X even , by separating the even and odd elements.Despite the rougher temporal resolution, the sub-sequences still maintain the essential information from the original sequence.After the splitting procedure, we apply different convolutional kernels to X odd and X even to extract their respective features.Because these kernels operate independently, the extracted features retain different temporal dependencies that augment their representational abilities.However, down-sampling may result in information loss, so we introduce a interactive-learning mechanism to enable exchange of information among the two sub-series.The interactive-learning convolution can be describe as follows: Firstly, two distinct Conv1D layers, namely α and β, are utilized to map the hidden states of X odd and X even , respectively.They are then transformered into the format of exponentiation and interact to the X odd and X even with the element-wise product (equation ( 5)).This procedure can be interpreted as a scaling transformation of X odd and X even , where the learned scaling factors are based on neural network modules that take each other into account.Secondly, we realign the extracted features from two sub-sequence into a new sequence representation.
Compared with traditional one-dimensional convolution, ILC have better capabilities to extract temporal features.ILC downsamples a sequence into two sub-sequences, extracts local features from them, which enlarges the receptive field of the convolutional kernel, and then learns to model interactive information between the two sub-sequences in order to obtain a global view of the entire sequence.We conduct comparative experiments in (section 4.4.2) to validate the effectiveness of ILC.
The global module is designed to capture the overall information and context of the output generated by the local module.A commonly used method for modeling global information is the self-attention mechanism.But in this paper, we design a CFE module that comprises of causal Fourier transform and inverse Fourier transform [37,49,50], as shown in figure 7, the causal Fourier transform pads the sequence of length S with placeholders zero of length S − 1 and utilize a sliding window of length S to ensure causality.We then perform Fourier transforms on each window sequence.This means that frequency features can be used to measure the global relevance of the entire series in a causal sequential manner.Next, we perform weighted averaging(softmax) on the causal frequency sequence generated by causal Fourier transform. where i ×c denote the result after the global correlations modeling.Y seasonal,l−1 is the output of (l − 1)th MLG layer.Y global,i seasonal,l ∈ R (L+T)×c represents the result of this pattern (i.e. this branch).CF(•) and F −1 (•) represent causal Fourier transform and inverse Fourier transform, respectively.
Then we use 2D-convolution to fuse the results from branches of different modes, and we truncate the output of the Nth MLG layer, denoted as Y seasonal,N , to obtain the ultimate result of the seasonal component prediction block, Y seasonal .The procedure is as follows: ) where Y seasonal,l ∈ R (L+T)×c , Y seasonal,N ∈ R (L+T)×c , Y seasonal ∈ R T×c denote the output of MLG layer, the output of MLG layer, and ultimate result of seasonal component prediction block,respectively.

Trend component prediction block
By explicitly handling the trend component of input sequence generated by decomposition module, the performance of a basic linear model is improved when the data exhibits a distinct trend [38].Therefore, in this paper, we use a simple full connected layer to make a prediction about the trend component.The fundamental formulation of the linear layer performs a direct regression of historical time series to make future predictions by means of a weighted summation operation, as depicted in figure 8.The mathematical expression is Y i = WX i , where W T×L is a linear layer along the temporal axis.Y i and X i are the prediction and input for each ith variate, respectively.Note that full-connected layer employs shared weights across various variables and does not extract any spatial correlations.The trend component prediction process is: where Y trend ∈ R T×c denotes the prediction of the trend part, FC(•) represents the full-connected layer.

Experiments
In this section, we present the experimental evaluation of our proposed MLGN model.Specifically, we conduct comprehensive experiments on six real-world datasets to validate the effectiveness of our approach.We compare our model with previous state-of-the-art baselines and analyze the results in terms of various metrics.Furthermore, we perform ablation studies on each module of our proposed MLGN model to investigate their individual contributions to the overall performance.The results of the ablation studies show that our model consistently outperforms the variants without certain modules, which demonstrates the importance of each component in our proposed framework.Finally, we evaluate the efficiency of our proposed model by comparing it with the latest mainstream models.To achieve this, we analyze the running time and memory consumption of different models on the same hardware platform.In summary, the extensive experiments conducted in this section demonstrate the effectiveness and efficiency of our proposed MLGN model for long-term sequence prediction and provide insights into the contributions of its individual components.

Datasets
The following is a detailed description of six real-world datasets: Electricity5 : logs the electricity usage every hour for 321 customers during the period between 2012 and 2014.
ETT6 : stores the temporal sequence data of power load and oil temperature obtained from electric transformers between 2016 and 2018.ETTm1/ETTm2 are measured every 15 minutes, while ETTh1/ETTh2 are measured per hour.Table 1 presents an overview of the general statistics of the datasets.We adopt a conventional practice, where each dataset is partitioned into three subsets-training, validation, and testing-based on their chronological order.The ETT dataset is divided in a 3:1:1 ratio, while others follow a 7:2:2 split.
Implementation details: The experiments are carried out using PyTorch [53] and run on an NVIDIA GeForce RTX 4080 16GB GPU for three runs.The models are trained using ADAM [54] with L2 loss, a batch size of 32, and an initial learning rate of 10 × 10 −4 .The structure of every model based on the Transformer architecture involves two encoder layers and one decoder layer.The test MSE/MAE is reported as the performance metric for different prediction lengths.A smaller value of MSE/MAE implies superior performance in time series prediction tasks.
where n represents the number of samples, y i represents the true value of the ith sample, and ŷi represents the predicted value of the ith sample.univariate setting are not only practical but also achievable.Furthermore, under the same-length series forecasting settings of input-96-predict-96 and input-24-predict-24, MLGN demonstrates superior and consistent predictive accuracy compared to Fedformer across all six datasets, with an overall relative MSE reduction of 7.58% and a relative MAE reduction of 2.45%.Under the LTSF settings of input-96-predict-720 and input-24-predict-60, MLGN also shows superior and consistent predictive accuracy compared to Fedformer across all six datasets, with an overall relative MSE reduction of 20.22% and a relative MAE reduction of 7.77%.These findings indicate that our proposed MLGN model is effective in handling LSTF tasks.

Main results
Visualization of forecasting results: The forecasting results from the test set of univariate datasets, Electricity and Traffic, are visualized in figures 9 and 10.Our model demonstrate superior performance among various models evaluated(visualization of other models can be found in appendix A).Specifically, MICN outperformed Transformer-based models in predicting overall changes and peaks within the time series.We can clearly observe that the predictions generated by the MLGN model are in close agreement with the ground truth data.Specifically, the MLGN model exhibits excellent capabilities for capturing periodic signals, which is evidenced by its ability to accurately model cyclical patterns in the data.
The forecasting results from the test set of multivariate datasets, ETTm1 and ETTm2, are visualized in figures 11 and 12.Our model demonstrate superior performance among various models evaluated(visualization of other models can be found in appendix A).In addition, the research suggests that MLGN outperforms other models in detecting and predicting turning points in time series data and exhibits more accurate proximity to actual outcomes.

Ablation studies 4.4.1. Multi-scale sequence decomposition vs single-scale sequence decomposition
Autoformer [24] and Dlinear [38] utilizes the single-scale series decomposition as an internal block of deep models and achieves excellent performance.Nonetheless, the patterns obtained through this decomposition are basic and may not be sufficient to effectively handle the complex and dynamic nature of time series.Therefore, we design a multi-scale sequence decomposition module which utilizes a series of average pooling kernels with varying sizes to capture the trend component of the input sequence and a group of data-dependent weights to perform a weighted average calculation on them.The default sizes of the average pooling kernels are set to 5, 10, 12, 22, and 46, respectively.For comparison purposes, we replace the multi-scale sequence decomposition module in MLGN with a single-scale sequence decomposition module, which uses a single average pooling kernel with a size of 25.As demonstrated in table 4, the experimental results show that the multi-scale sequence decomposition module designed in this study significantly outperforms the single-scale sequence decomposition module in most cases.The Electricity dataset contains hourly electricity consumption records of 321 users, while the ETT dataset mainly forecasts oil temperature data, which is relatively stable.For ETT dataset, we achieve a similar performance because it has no obvious temporal pattern compared to Electricity.However, on the Electricity dataset, the advantage of our proposed method is more significant.The result verifies that multi-scale sequence decomposition structure is more in line with the complex temporal patterns in real-time series.

ILC vs Regular Convolution(RC)
To validate the rationality and effectiveness of the ILC module.We replace the ILC in the local module of MLGN with regular Conv1D, and we compare the performance of both methods in the multivariate datasets ETTm1, ETTm2, Electricity, as shown in table 5.The term MLGN-RC refers to the MLGN model employing sliding the convolution kernel along the whole sequence, but it fails to consider the interaction between different positions.Consider a convolutional filter with kernel size k, for the bottom layer of the conventional TCN, the receptive field of each CNN layer is simply k.However, for ILC, due to downsampling, the receptive field of each CNN layer(convolutional filter) is already enlarged to roughly 2k.Additionally, the ILC module can receive features from other branches, thereby further expanding the field of acceptance.To compensate for the loss of information during the downsampling operation, the interactive learning process aggregates the basic information extracted from the two downsampled subsequences.Compared with one-dimensional expansion convolution in ordinary TCN, these subsequences have a more complete local and global view of the time series, thereby better extracting complex time domain characteristics.

CFE VS Self-Masked Attention, Isometric Convolution
By utilizing the local module of MLGN, we acquire a brief sequence that characterizes the local features.
Building upon this,we propose incorporating a CFE module into the global module to model the overall information of the sequence.CFE is able to model the global correlations from different branches of the local features.The use of a causality-preserving Fourier transform ensures that only information from past and present time steps is utilized, avoiding information leakage, while conforming to the causal nature of TSF tasks.FT can also capture periodic patterns and changes across different frequencies in time series data, enriching the representation of the sequence.Then, IFT maps the frequency domain representation back to the time domain, providing an enhanced time domain sequence as input to the next layer.This module endows the model with the capability of joint time-frequency domain modeling, enabling the model to concurrently learn representations in both the temporal and spectral domains, enhancing the predictive performance of the model.Previously, masked self-attention was the primary choice of method.However, during training, we substitute the CFE module in the global module of MLGN with masked self-attention and isometric convolution [30], respectively.The results, presented in table 6, demonstrate that the CFE module generally outperforms masked self-attention and isometric convolution in terms of the MSE/MAE metrics in most cases.

Model analysis 4.5.1. The impact of input length and output length on prediction results
In time series prediction tasks, the size of the input length represents the amount of historical information that the model can leverage.Generally, models that possess a robust capacity to capture long-term temporal interdependencies ought to exhibit superior performance as the input length expands.Consequently, we execute experiments with varying input lengths but an identical prediction horizon to authenticate our model.As illustrated in figure 13, when the input sequence length is relatively long and the prediction length is relatively short, Transformer-based models tend to exhibit poorer performance.This is due to the presence of repetitive short-term patterns that overly influence future predictions, and the inadequate ability to extract long-term temporal features, as discussed in [22].In contrast, the overall performance of MLGN prediction   shows gradual improvement as the input length increases.This is because MLGN can adaptively capture both long-term and short-term temporal dependencies.
We conduct experiments with diverse prediction lengths but an identical input length.As depicted in figure 14, when the prediction length is larger, both Transformer-based models and MLGN exhibit inferior performance.Nevertheless, we can discern that the performance of MLGN exhibits steady changes as the prediction length O expands, whereas the performance of Transformer-based models manifests considerable instability and undergoes a rapid decline.This indicates that MLGN preserves superior long-term robustness, which holds significant relevance for practical real-world applications, such as early warning systems for weather forecasting and long-term planning of energy consumption.

Robustness of our model against noise
To validate the robustness of our model against noise, we conduct a simple noise injection experiment on datasets with different granularities [10 min, 1 h, 1 d].Specifically, we randomly select a proportion ε of the data from the original input sequence and jam the selected data within the range of [−2Mi, 2Mi], where Mi represents the original data.We train the model with the injected noise data and record the MSE and MAE metrics.The results are presented in table 7.As the disruption proportion ε increases, the predicted MSE and MAE metrics slightly increase.This indicates that MLGN has excellent noise resistance, capable of handling up to 10% noise on datasets with different granularities.8 summarizes the time complexity and memory usage for model training.Theoretically, both the time complexity and memory usage of the MLGN model exhibit linear complexity, endowing it with the advantages of high efficiency and scalability.Linear complexity means that the algorithm's runtime and memory usage increase proportionally to the size of the input data, allowing it to handle larger datasets.Moreover, algorithms with linear complexity are typically easier to implement and optimize compared to those with non-linear complexity.

Conclusion and future work
In this research, we introduce a new neural network architecture, MLGN, which extracts features from time series at multiple scales and makes predictions by combining the trend forecasting component with the seasonal forecasting component.Our proposed method, MLGN, achieves a time and memory complexity of O(L) and and consistently achieves cutting-edge performance on various real-world datasets.We propose a local-global structure to facilitate information aggregation and long-term dependency modeling for time series.We design down-sampling ILC for local feature extraction and CFE mechanism for global correlation discovery.Extensive experiments further demonstrate the effectiveness and efficiency of our modeling approach for long-term forecasting tasks.Moreover, Fourier transform is the most widely used method for extracting frequency information, but it has some issues that lower model performance, such as high-frequency noise caused by the Gibbs phenomenon and computational overhead of the inverse transformation in the Fourier transform-Inverse Fourier transform (FT-IFT) process.Theoretically, DCT (Discrete Cosine transform) can avoid the Gibbs effect [50], and since the frequency features transformed by DCT are real-valued, they can directly participate in neural network computations, facilitating fusion of time and frequency information.Therefore, we will consider further improving MLGN in the future by exploring solutions based on DCT to replace Fourier Transform.To further validate the generalizability of MLGN, we will apply it to various time series tasks such as time series classification, time series anomaly detection, and time series imputation.

Figure 1 .
Figure 1.The mainstream sequence modeling architectures for time series forecasting.
Finally, we utilize inverse Fourier transform to convert frequency-domain signals back into time-domain signals.We demonstrate that for the majority of cases, CFE outperforms self-attention.The detailed experiments of the proof are shown in section 4.4.3.To maintain consistent sequence length, we utilize transposed convolution to upsample the outcome of the dausal frequency enhancement block.The global module can be formalized as follows:

Exchange 7 :
gathers the panel dataset of daily currency exchange rates for eight nations between the years 1990 and 2016.ILI 8 : gathers the weekly reports on the ratio of influenza-like illness patients to total patients from the United States Centers for Disease Control and Prevention between 2002 and 2021.Traffic 9 : holds records of hourly traffic occupancy rates monitored by 862 sensors situated along the highways in the San Francisco Bay region from 2015 through 2016.Weather 10 : comprises a collection of meteorological time-series datasets consisting of 21 weather indicators measured every 10 minutes by the Biogeochemistry Institute of Max Planck.

Table 4 .
Ablation of multi-scale sequence decomposition.MLGN-M adopts multi-scale series decomposition method as Baseline.MLGN-S replaces the multi-scale series decomposition method with single-scale decomposition module of Autoformer.

Figure 15 .
Figure 15.Efficiency Analysis.In the running time experiment, we maintain a constant prediction length of T = 720 and vary the input length as L = [24, 48, 168, 336, 720].For the memory usage experiment, we set L = 96 and T = 720.
shows a comparison of MLGN with five previous models, Fedformer, Autoformer, Pyraformer, Informer, and Transformer.Results demonstrate that during training, MLGN has significantly lower training time and memory usage compared to the other models.The larger number of parameters in Transformer-based methods contributes to their longer training times.MLGN's memory efficiency makes it useful for processing large datasets or running on resource-limited devices.

Figure A4 .
Figure A4.Univariate forecasting cases using the Traffic dataset with model MLGN.

Figure A5 .
Figure A5.Univariate forecasting cases using the Traffic dataset with model Autoformer.

Figure A6 .
Figure A6.Univariate forecasting cases using the Traffic dataset with model Informer.

Figure A8 .
Figure A8.Multivariate forecasting cases using the ETTm1 dataset with model Autoformer.

Figure A9 .
Figure A9.Multivariate forecasting cases using the ETTm1 dataset with model Informer.

Figure A10 .
Figure A10.Multivariate forecasting cases using the ETTm1 dataset with model Transformer.

Figure A11 .
Figure A11.Multivariate forecasting cases using the ETTm2 dataset with model MLGN.

Figure A12 .
Figure A12.Multivariate forecasting cases using the ETTm2 dataset with model Autoformer.

Figure A13 .
Figure A13.Multivariate forecasting cases using the ETTm2 dataset with model Informer.

Figure A14 .
Figure A14.Multivariate forecasting cases using the ETTm2 dataset with model Transformer.

Figure C15 .
Figure C15.The total runtime of inference phase.

Figure E16 .
Figure E16.Visualization of learned trend part prediction result Y trend and seasonal part prediction result Y seasonal in dataset under MLGN with setting Input-96-Prediction-96.

•
We propose a local-global structure to facilitate information aggregation and long-term dependency modeling for time series, surpassing the performance of dilated causal convolution-based methods TCN and Transformer-based methods.We design a down-sampling ILC for local feature extraction and CFE module for global correlation discovery.•MLGN achieves a relative improvement of 12.98% and 11.38% for multivariate and univariate time series, respectively, over six benchmark datasets including five real-world scenario: economics , energy, weather, disease, and traffic.

Table 1 .
Statistics of datasets.

Table 2 .
To evaluate the predictive performance of our model in simple univariate scenarios, we conduct univariate forecasting experiments and summarize the results in table 3. Our proposed MLGN method remains at the forefront in achieving the best performance forlong sequence time-series forecasting (LSTF) tasks.Compared to the previous state-of-the-art model Fedformer, MLGN exhibits an overall relative MSE reduction of 11.38% and a relative MAE reduction of 4.91%.The majority of the assessment indices (37/48) surpass those of existing models, indicating that our MLGN model's predictions in a straightforward The outcomes of multivariate forecasting for various prediction horizons O ∈ {96, 192, 336, 720} on six real-world datasets.For dataset ILI , the input sequence horizon is set to 36, while for the other datasets, it is set to 96.
Multivariate results: To test the predictive performance of the model in complex environments, we conduct multivariate forecasting experiments.The following can be found in table 2. Our proposed method MLGN delivers state-of-the-art results across all benchmarks and prediction horizons, this also indicates the effectiveness and stability of our method.Compared to the previous best state-of -the-art model Fedformer, MLGN exhibits an overall relative MSE reduction of 12.98% and a relative MAE reduction of 11.03%.

Table 3 .
The outcomes of univariate forecasting for various prediction horizons O ∈ {96, 192, 336, 720} on six real-world datasets.For dataset ILI , the input sequence honrizon is set to 36, while for the other datasets, it is set to 96.

Table 5 .
Ablation of interactive learning convolution.MLGN-ILC adopts interactive learning convolution as Baseline.MLGN-RC replaces the interactive learning convolution with regular one dimensional convolution.

Table 6 .
Ablation of causal frequency enhancement module.We substitute the causal frequency enhancement module in MLGN with masked self-attention and isometric-convolution, and execute it on multivariate datasets such as Electricity, Exchange, and Traffic.The best results are emphasized in bold.Compared to MLGN-RC, MLGN-ILC exhibits an overall relative reduction of 16.22% in MSE and a relative reduction of 13.03% in MAE across all three datasets.Specifically, on the Electricity dataset, MLGN-ILC demonstrates an impressive 40.02% reduction in MSE (from 1.177 to 0.778) and a 28.83% reduction in MAE (from 1.654 to 1.297).The traditional Conv1D model extracts local features by

Table 7 .
Results of robustness test against noise interference.ε represents the proportion of noise injection, with MLGN being considered as the baseline.

Table
Complexity analysis of different forecasting models.