The impact of memory on learning sequence-to-sequence tasks

The recent success of neural networks in natural language processing has drawn renewed attention to learning sequence-to-sequence (seq2seq) tasks. While there exists a rich literature that studies classification and regression tasks using solvable models of neural networks, seq2seq tasks have not yet been studied from this perspective. Here, we propose a simple model for a seq2seq task that has the advantage of providing explicit control over the degree of memory, or non-Markovianity, in the sequences—the stochastic switching-Ornstein–Uhlenbeck (SSOU) model. We introduce a measure of non-Markovianity to quantify the amount of memory in the sequences. For a minimal auto-regressive (AR) learning model trained on this task, we identify two learning regimes corresponding to distinct phases in the stationary state of the SSOU process. These phases emerge from the interplay between two different time scales that govern the sequence statistics. Moreover, we observe that while increasing the integration window of the AR model always improves performance, albeit with diminishing returns, increasing the non-Markovianity of the input sequences can improve or degrade its performance. Finally, we perform experiments with recurrent and convolutional neural networks that show that our observations carry over to more complicated neural network architectures.


Introduction
The recent success of neural networks on problems in natural language processing [1][2][3][4][5] has rekindled interest in tasks that require transforming a sequence of inputs into another sequence (seq2seq).These problems also appear in many branches of science, often in the form of time series analysis [6,7].Despite their ubiquity in science, there has been little work analysing learning of seq2seq tasks from a theoretical point of view in simple toy models.Meanwhile, a large body of work emanating from the statistical physics community [8][9][10][11][12] has studied the performance of simple neural networks on toy problems to develop a theory of supervised learning, where a high-dimensional input such as an image is mapped into a low-dimensional label, like its class.
An important recent insight from the study of supervised learning was the importance of data structure for the success of learning.The challenge of explaining the succes of neural networks on computer vision tasks [13][14][15][16] inspired a new generation of data models that take the effective lowdimensionality of images [17] into account, such as object manifolds [18], the hidden manifold [19,20], spiked covariates [21,22], or low-dimensional mixture models embedded in high dimensions [23][24][25].

AR CNN RNN
Figure 1: A flexible, minimal model for sequence-to-sequence learning tasks with varying degrees of memory along the sequence.Left: The motion of a Brownian particle (black filled circle) in a switching parabolic potential "trap" yields dynamics as given by the SSOU (stochastic switching Ornstein Uhlenbeck) model given by eq. ( 1).Example trajectories for sequences of the particle's position X t (black line) and the trap center C t (blue line) as shown as a function of time t.The blue dashed line is set at the trap centers C 0 = ±1.Middle: We train three types of models to reconstruct the trap positions C t from the particle trajectories X t : auto-regressive models (AR) that makes predictions of the trap position based on the past W observations of the particles position, as well as 1D convolutional neural networks (CNN) that acts as a set of f parallel AR models with added nonlinearity in the output, and recurrent neural networks (RNN) that use feedback loops with a d dimensional internal state to capture dependencies across time steps.Right: Sample reconstruction of the particle trap positions ŷt (red dashed line) for one example input sequence (X t , black solid line) compared with the actual hidden trap position (C t , blue solid line).The sequence is a zoomed-in view of the sample sequence in the left panel.Parameters: κ = 10, k = 1, D = 0.5, simulation time step ∆t = 0.02, AR model window size W = 2.For this example, we used 5000 training samples, evaluated them over 5000 test samples, a mini-batch size 32 and 5 epochs.
By deriving learning curves for neural networks on these and other data models [26][27][28][29], the importance of data structure for the success of neural networks compared to other machine learning methods was clarified [21,23,24,30].The aforementioned breakthroughs on seq2seq tasks make it an important challenge to extend the approach of studying learning in toy models to cover the realm seq2seq tasks.What are key properties of sequences, akin to the low intrinsic dimension of images, that need to be modelled?How do these properties interact with different machine learning methods like auto-regressive models and neural networks?
Here, we make a first step in this direction by proposing to use a minimal, solvable latent variable model for time-series data that we call the stochastic switching Ornstein-Uhlenbeck process (SSOU).The input data consists of a sequence x = (x t ) T t=0 , which depends on a latent, unobserved stochastic process c = (c t ) T t=0 .The learning task is to reconstruct the unobserved sequence c from the input sequence x, cf.fig. 1.The key data property we aim to describe and control is the memory within the input sequence x.In the simplest scenario, the sequence x t is memoryless: given the value of the present token x t , the value of the next token x t+1 is statistically dependent solely on x t but not on previous tokens x t ′ <t .Such a sequence can be described by a Markov process.By tuning the dynamics of the latent process c t , we can control the memory of the sequence, i.e. we can increase the statistical dependence of future tokens on their full history x t ′ <t (we introduce a precise, quantitative measure for the memory below).Adding memory thus makes the process non-Markovian and allows us to model richer statistical dependencies between tokens.At the same time, the presence of memory in a process makes the mathematical analysis generally harder.Our goal is to analyse how different neural network architectures -auto-regressive (AR), convolutional (CNN), and recurrent (RNN) -handle the memory when solving the seq2seq task x → c.

A model for sequence-to-sequence tasks
We first describe a latent variable model for seq2seq tasks, which we call the Stochastic Switching-Ornstein-Uhlenbeck (SSOU) model.This model was introduced recently in biophysics to describe experimental recordings of the spontaneous oscillations of the tip of hair-cell bundles in the ear of the bullfrog [31].Furthermore, a simplified variant of the model with exponential waiting times has also been fruitfully explored in the experimental context to describe the relaxation oscillations of colloidal particles in switching optical traps [32,33].
We consider observable sequences X = (X t ) which are described by a one-dimensional stochastic process whose dynamics is driven by an autonomous latent stochastic process C = (C t ) and a Gaussian white noise that is independent to C t .Here and in the following we index sequences by a time variable t ≥ 0, and use bold letters such as X to denote sequences, or trajectories.A sequence of length T can be sampled from the stochastic differential equation Here, dB t is the increment of the Wiener process in [t, t + dt], with ⟨dB t ⟩ = 0 and ⟨(dB t )(dB t ′ )⟩ = δ(t − t ′ )dt.The angled brackets ⟨•⟩ indicate an average over the noise process.We denote the parameters D > 0 and κ > 0 as diffusion coefficient and trap stiffness for reasons described below.
We employ this setup as a seq2seq learning task where we aim to reconstruct the hidden sequence of trap positions C given a sequence of particle positions X, see fig. 1 for an illustration.The key idea in our model is to let the location of the potential C t alternate in a stochastic manner between the two positions C 0 = {−1, 1}.The waiting time τ spent in each of these two positions is drawn from the waiting-time distribution ψ k (τ ).For a generic choice of ψ k (τ ), the process C t -and hence X t -is non-Markovian and has a memory.Here, we use a one-parameter family of gamma distributions (defined in eq. ( 6) below), which allows us to quantitatively control the degree of memory in the sequence of tokens C in a simple manner; we discuss this in detail in section 4.1.

Physical interpretation of the SSOU model
A well-known application of eq. ( 1) in statistical physics is to describe the trajectories of a small particle (e.g. a colloid) in an aqueous solution undergoing Brownian motion [34] in a parabolic potential with time-dependent center C t .Such physical model is often realized with microscopic particles immersed in water and trapped with optical tweezers [32,35,36].When κ = 0, the particle is driven only by the noise term dW t and performs a one-dimensional free diffusion along the real line with diffusion coefficient D. By choosing κ positive, the particle experiences a restoring force −κ(X t − C t ) towards the instantaneous center of the potential.Such force F (X t , C t ) = − ∂ X V (X, C)| X=Xt, C=Ct tends to confine the particle to the vicinity of the point C t , and is therefore often called a particle trap, which is modelled by a harmonic potential centred on This motivates the name "stiffness" for κ; the higher κ, the stronger the restoring force that confines the particle to the origin of the trap C t .From a physical point of view, the dynamics of X t consists of the alternate relaxation towards the two minima of the potential between consecutive switches, cf.fig. 1.

Architectures and training
Auto-regressive model We first consider the arguably simplest machine-learning model that can be trained on sequential data, an auto-regressive model of order W , AR(W ), see fig. 1.The output ŷ = (ŷ t ) of the model for the trap position given the sequence x is given by where σ(x) = 1/(1 + e −x ) is the sigmoidal activation function, and the weights w τ and the bias b are trained.Its basic structure -one layer of weights followed by a non-linear activation functionmakes the AR model the seq2seq analogue of the famous perceptron that has been the object of a large literature in the theory of neural networks focused on classification [8][9][10]12].Note that the window size W ≥ 1 governs the number of tokens accessible to the model.We do not pad the input sequence, so the output sequence is shorter than the input sequence by W − 1 steps.

Convolutional two-layer network
The auto-regressive model can also be thought of as a single layer 1D convolutional neural network [37,38] with a single filter and a kernel of size W with sigmoid activation.We also consider the natural extension of this model, CNN(W ), a 1D convolutional neural network with two layers (see fig. 1).The first layer has the same kernel size (W ), but contains f filters with rectified linear activation [39].In the second layer, we apply another convolution with a single filter and a kernel size of 1 with sigmoid activation.In this way, we can compare a CNN and an AR model whose "integration window" are of the same length W .This allows us to investigate the effect of additional nonlinearities of the second layer of the CNN on the prediction error.

Recurrent neural network
We also apply a recurrent network to this task that takes x t as the input at step t, followed by a layer of Gated Recurrent Units [40] with a d-dimensional hidden state, followed by a fully connected layer with a single output and the sigmoid activation function (see fig. 1).We refer to this family of models by GRU(d).
Generating the data set We train all the models on a training set with N sequences {x (n) } N n=1 , which can in principle be of different lengths.To obtain a sequence, we first generate a trajectory C for the latent variable by choosing an initial condition at random from C 0 ∈ {−1, 1} and then drawing a sequence of waiting times from the distribution ψ k (τ ).Using the sampled trap positions, C, we then sample the trajectory X from eq. (1).Since in practice we do not have access to the full trajectory, as any experiment has a finite sampling rate, we subsample the process X t by taking every sth element of the sequence.That is, for a given sequence {X ′ t } T ′ t ′ =0 of length T ′ , we construct a new sequence x = (x t ) T t=1 , where x t = X s×t and T = ⌊T ′ /s⌋.The subsampled sequence x is then used as an input sequence.We verified that the finite time step for the integration of the SDE and the subsampling preserve the statistics of the continuum description that we use to derive our theoretical results, cf.appendix B.2.2.For convenience, we also introduce y = (1 + c)/2 to shift the trap position to 0 or 1.We then use this subsampled and shifted sequence of trap positions as the target.
Training and evaluation We train the models by minimising the mean squared loss function using Adam [41] with standard hyperparameters.For each sample sequence, the loss is defined as where the offset τ 0 in the lower limit of the sum is necessary for the AR and CNN models since their output has a different length than the input sequence.For those models we choose τ 0 = W − 1.
For RNN models, however, we use τ 0 = 1 as the input and out sequence lengths are the same.We assess the performance of the trained models by first thresholding their outputs ŷt at 0.5 to obtain the sequence ĉt ∈ {±1}, since the location of the centre of the potential has two possible values ±1.We then use the misclassification error ϵ defined as where N τ is the length of the test sequence and δ(x) = 1 if the condition x is true and it is 0 otherwise, as the figure of merit throughout this work.Unless otherwise specified, we used 50000 training samples with mini-batch size 32 to train the models, and we evaluated them over 10000 test samples.To consistently compare different models with different output lengths and to reduce the boundary effects we only consider the predictions after an initial time offset τ h .

The waiting time distribution of the trap controls the memory of the sequence
The key data property we would like to control is the memory of the sequence X.In our SSOU model, the memory is controlled by tuning the memory in the latent sequence C, which in turn depends on the distribution of the waiting time τ .To fix ideas, we choose gamma-distributed waiting times, Quantifying the memory of non-Markovian input sequences.Left: waiting-time distribution ψ k (τ ) given by eq. ( 6) for three choices of k.Right: measure M (t) (see eq. ( 8)) associated with the parameters in the left panel, which quantifies the memory of the past time t 1 in the sequence.We obtain a memoryless sequence X t with M (t) = 0 by choosing k = 1 which creates an exponential waiting-time distribution, eq. ( 6) (red curves).As the value of k increases, the memory becomes stronger (blue and green curves).Parameters: D = 0.5, κ = 2, Ω 1 = Ω 2 = [0.5, 1.5], t 3 − t 2 = 0.5, by generating N = 10 5 trajectories via the Euler-Maruyama method [42] with simulation time step ∆t = 5 × 10 −3 .
where we set k ≥ 1, while Γ(k) = ∞ 0 dx x k−1 e −x denotes the Gamma function.Note that for any choice of k, the normalization condition is ∞ 0 ψ k (τ )dτ = 1 and the average waiting time between two consecutive switches is ⟨τ ⟩ k = ∞ 0 τ ψ k (τ ) dτ = 1.We can control the shape of the distribution by changing the value of k: the variance of the waiting time for example is k-dependent, 2).Furthermore, k also controls the "degree of non-Markovianity" of the latent sequence C as follows.
For the choice k = 1, the waiting-time distribution is exponential, making the process C t Markovian.This result, together with the fact that eq. ( 1) is linear, implies that the observable process X t is Markovian, too.Thus the probability distribution for the tokens x t obeys the Markov identity [34] p 1|2 (x t 3 |x t 2 ; for any t 1 < t 2 < t 3 , which represent different instants in time, and p 1|m denotes the conditional probability density (at one time instant) with m conditions (at m previous time instants).Equation ( 7) is the mathematical expression of the intuitive argument that we gave earlier: in a Markovian process, the future state x t 3 at some time t 3 given the state x 2 at some earlier time t 2 < t 3 is conditionally independent of the previous tokens x 1 for all times t 1 < t 2 .However, most of the sequences analysed in machine learning are non-Markovian: in written language for example, the next word does not just depend on the previous word, but instead on a large number of preceding words.We can generate non-Markovian sequences by choosing k > 1.To systematically investigate the impact of sequence memory on the learning task, it is crucial to quantify the degree of non-Markovianity of the sequence for a given value of k beyond the binary distinction between Markovian and non-Markovian.Yet defining a practical measure that quantifies conclusively the degree of non-Markovianity is a non-trivial task [43,44] and subject of ongoing research mainly done in the field of open quantum systems [45][46][47][48][49].
Here, we introduce a simple quantitative measure of the degree of non-Markovianity of the input sequence motivated directly by the defining property of Markovianity given in eq. ( 7), which reads Equation ( 8) involves crucially conditional expectations 1 , where the expectation of the future system state (at time t 3 ) is conditioned on the present state (at time t 2 ), or on the present and past state (at times t 2 and t 1 ).We have further introduced the notion of state space regions Ω 1 ⊆ X and Ω 2 ⊆ X which are subsets of the entire state space X = R that is accessible by the process X.The measure M defined in eq. ( 8) quantifies the drop in uncertainty about the future mean values, when knowledge about the past (x t 1 ) is given in addition to the knowledge of the present state.Because of noise, this generally depends on how far from the past the additional data point is, t 2 − t 1 , where the decay of M with t 2 − t 1 quantifies the "memory decay" with increasing elapsed time.From eq. ( 7) it directly follows that for any Markovian process, the measure M defined in eq. ( 8) vanishes at all times, whereas M ̸ = 0 reveals the presence of non-Markovianity in the form of memory of the past time t 1 .In a stationary process, M generally depends on the two time differences t 3 − t 2 and t 2 − t 1 , and on the choices of Ω 1 and Ω 2 .To spot the non-Markovianity, t 3 − t 2 should be comparable to (or smaller than) the dynamical time scales of the process, which can be defined by the decay of correlation functions, which we show in fig.6 in the appendix (namely, if t 3 − t 2 is so large that x t 3 and x t 2 are fully uncorrelated, M trivially vanishes even for non-Markovian processes).The choice of Ω 1 and Ω 2 is in principle arbitrary, but, for practical purposes, they should correspond to regions in the state space that are frequently visited.We plot M in fig. 2 for three different values of k obtained from numerical simulations of the SSOU.Here, we fix t 2,3 , Ω 1,2 , and vary t 1 .As expected, for a Markovian switching process with k = 1, M (t 1 ) vanishes at all times t 1 , while for k > 1, M (t 1 ) displays non-zero values.The non-Markovianity measure M (t 1 ) defined in eq. ( 8) generally captures different facets of non-Markovianity.On the one hand, the decay with t 2 − t 1 measures how far the memory reaches into the past.On the other hand, the magnitude of M (t 1 ) tells us how much the predictability of the future given the present state profits from additionally knowing the past state at time t 1 .For the SSOU model, we observe in fig. 2 that increasing k increases both the decay time and the amplitude of M , showing that that the parameter k controls conclusively the degree of non-Markovianity.We further note that M (t 1 ) displays oscillations for k > 1, reflecting the oscillatory behaviour of C and X, and that M (t 1 ) always decays to zero for sufficiently large values of t 2 − t 1 , indicating the finite persistence time of the memory in the system.We note that these observations are robust with respect to the details of the non-Markovianity measure.For example, we numerically confirmed that when we change t 3 − t 2 , or Ω 1,2 , the essential features of M and its dependence on k remain unchanged.
In conclusion of this analysis, we can in the following simply use k as control parameter of the degree of non-Markovianity of X.

The performance of auto-regressive models 4.2.1 The interplay between two time scales determines prediction error
To gain some intuition, we first consider the simplest possible students, namely auto-regressive models AR(2) from eq. ( 3) with window size W = 2.The student predicts the value of c t only using information about the present and the previous tokens x t and x t−1 , giving it access to the current particle position and allowing it in principle also to estimate the velocity of the particle.We show the performance of this model obtained numerically in fig. 3. The error of AR(2) varies significantly from less than 5% to essentially random guessing at 50% error as we vary the two time scales that influence the sequence statistics, t κ = 1/κ and the relaxation time and the diffusive time scale.The relaxation time t κ determines how quickly the average position of the particle is restored to the centre of the trap when the the average waiting time ⟨τ ⟩ k = 1.A faster relaxation time means that the particle responds more quickly to a change in c t , making reconstruction easier.The diffusive time scale t diff sets the typical time that the system elapses to randomly diffuse a distance ∼ |C 0 |.The larger this time scale, the slower the particle moves randomly, as opposed to movement that follows the particle trap, hence improving the error.

A "phase diagram" for learnability
For memoryless sequences (k = 1), we can explain this performance more quantitatively by solving for the data distribution p(x t ).Interestingly, we observe that along the "phase boundary" between unimodal and bimodal particle distribution p(x t ), the error is roughly homogeneous and approximately equal to 25%.The presence of two "phases" in terms of the bimodality/unimodality of the distribution p(x t ) is rationalised using recent analytical results for the loci of the phase boundary in the the case k = 1 [31], which is given by This analytical result is obtained by finding the parameter values at which the derivative of p(x t ) at x t = 0 changes sign, which corresponds to a transition between unimodal and bimodal (see appendix A.1 for a detailed derivation).The line separating the two phases is drawn in black in the density plot of fig. 3.In particular, we observe that reconstructing the trap position is simplified when the marginal distribution of the particle distribution p(x t ) is bimodal, i.e. when it presents two well defined peaks around the trap centres ±C 0 (see fig. 3 top left).On the contrary, when the distribution p(x t ) is unimodal (corresponding to the case of fast relaxation times t κ ) the learning performance worsens as the data is not sufficiently informative about the latent state C.
If we add memory to the sequence by choosing k > 1, an analytical solution for the phase boundary is challenging, as it requires knowledge of the steady-state distribution of the joint system.For k = 1, the system is Markovian and we can solve the Fokker-Planck equation for the steady-state distribution.For k ̸ = 1, the steady-state distribution is the solution of an integro-differential equation which is not solvable analytically.However, we can verify whether the same mechanism -a transition from a unimodal to a bimodal distribution for the particle distribution p(x t ) -drives the error also in the non-Markovian case by means of numerical simulations.Interestingly, when increasing the degree of memory to k = 5 (top right of fig.3), we find that the AR(2) model yields a larger error than for k = 1 Markovian sequences for all the parameter values explored in the learnability phase diagram.This means that not only that AR(2) is not able to extract all the available information from the sequence, but that the task has become harder by increasing k.A look at the density plots for the particle distribution p(x t ) explains this finding: the transition to unimodality happens for a smaller value of the relaxation time t κ in the non-Markovian case (bottom right of fig.3).We go into detail on how the degree of non-Markovianity makes the task harder by quantifying this transition in the probability distribution in the next section.
Further statistical characterisation of the error We can quantify the degree of bimodality of the particle distribution p(x t ), which drives the error of the auto-regressive model, by using the Sarle coefficient [50], Sa = (σ 2 + 1)/K (11) with σ and K the skewness and kurtosis of x t , respectively.The Sarle coefficient takes values from 0 to 1 and is equal to 5/9 for the uniform distribution.Higher values may indicate bimodality.We can see a clear correlation between the Sarle coefficient and the error (ϵ) of the AR(2) model from the scatter plot on the right of fig. 3. Indeed, there appears to exist an upper bound on the error in terms for the Sarle coefficient.Another measure we can compute is the excess kurtosis, which for a zero-mean random variable reads 0.4 0.5 0.6 0.7 0.8 0.9 Sa  11).(Right) Scatter plot of the error of several AR(2) models trained on Markovian sequences versus the excess kurtosis, eq. ( 12), another measure of the bimodality of the distribution.Parameters: Similar to those used in the right panel of fig. 3 where the average is over p(x t ) and we have used the fact that ⟨x t ⟩ = 0 by symmetry.The excess kurtosis vanishes for a Gaussian distribution and is positive or negative for non-Gaussian distributions, depending on the behaviour of the tails of the distribution (hence the name excess kurtosis).A negative value of the excess kurtosis instead indicates that the distribution is sub-Gaussian or platykurtic, i.e. having narrower tails than the Gaussian distribution.In our case, we obtain negative values of excess kurtosis whose magnitude gets larger when the distribution looks more bimodal.Furthermore, we find that the larger the more negative is the excess kurtosis, the lower is the prediction error of the AR(2) model (right of fig.4).Taken together, this analysis shows that the prediction error becomes large when the peaks of the distribution become less distinguishable and/or in the presence of fat tails.

4.3
The interplay between sequence memory and model architecture

The statistical properties of the input sequence x with memory
We saw in fig. 3 that increasing the sequence memory by increasing the control parameter k of the waiting time distribution, eq. ( 6), makes the problem harder: the error of the same student increased.We can understand the root of this difficulty by studying the variance of the distribution p(x t , c t ).Using the tools introduced by Tucci et al. [31], we can calculate these correlations analytically for any integer k.As we describe in more detail in appendix A.2, we find that where , and q n ≡ 1 − e iπ(1+2n)/k with n = 0, . . ., k − 1; we recall that ⟨x t ⟩ = ⟨c t ⟩ = 0. We plot both correlation functions as we vary k on the right of fig. 5. First, we note that the variance of the particle distribution decreases with k (red line).In other words, as we increase the memory of the sequence, the particle spends on average more time around the origin, making reconstruction harder.This is also born out by a decrease in the correlation between x t and c t (blue (Left) Analytical predictions for the covariance of the particle and trap position, eq. ( 13), in the non-Markovian case as a function of k.As we increase the non-Markovianity, correlations between x t and c t decay, complicating the reconstruction task.(Right) Prediction error ϵ for students with different architectures: auto-regressive (AR), convolutional (CNN) and gated recurrent neural network (GRU).The "memory units" of the architectures correspond to the size of the kernel (W ) in the AR and the first layer of CNN models, and the number of units (d) in the hidden state of GRU models.The CNN models all have f = 10 filters in their first layer.Parameters: Similar to fig. 3 with κ = 2. line).This loss in correlations is closely related to the error of the simplest reconstruction algorithm, where we estimate the particle positions c by thresholding the particle position, c t = sign(x t ), and |C 0 | identifies the amplitude of the C t process.As shown by a light blue line in the right panel of fig. 5, the error of this parameter-free, memoryless algorithm increases monotonically with k.

Increasing the integration window of autoregressive and convolutional neural networks
The existence of memory in the sequences suggests choosing student architectures that can exploit these additional statistical dependencies.We first studied the performance of AR(W ) models with varying W .As shown in fig.5, we observed that for a fixed k increasing W generally helps with reducing the error ϵ (solid lines), so models with larger integration window can take advantage of the additional temporal correlations in the sequence and make better predictions in the large k limit.For the convolutional networks described in section 3 with f = 10 filters (dashed lines), we observe that the additional filters and non-linearities in the CNN do not help with the predictions compared to the AR(W ) models.We conclude that it is truly the size of the integration window that makes a difference on this data set and that even a simple AR model can completely take advantage of the information in a given time window.

Recurrent neural networks
We numerically studied the performance of a gated recurrent unit GRU(d) [40] with hidden state size d, which can be thought of as a slightly more streamlined variant of the classic long short-term memory [51].Although such an RNN reads in the input sequence one token at a time, it can exploit long-range correlations by continuously updating its internal state h ∈ R d , cf. fig. 1.We show the test performance of GRU models with various internal state sizes d in fig. 5.The GRUs display a good performance: as soon as the GRU(d) has a hidden state with dimension d = 2, it captures all the memory of the sequence at all k, achieving the limiting performance of AR(W ) models with the largest window size.Indeed, the error curves for GRUs with d ≥ 2 all coincide.This observation, together with the limited performance gain from increasing W in the AR(W ) and CNN(W ) models, suggest that the error is approaching the Bayes limit.Note that the rate of this convergence depends on k, since the size of the integration window required to achieve the optimal performance depends on the degree of non-Markovianity of the sequence (see fig. 5).While we focused on comparing the performance of the models here, it is important to accurately compare the computational cost of these models as well.For a detailed comparison of the complexity we refer the reader to appendix A.3.

The interplay between sequence memory and model architecture
Finally, we found that for all three architectures -AR, CNN and GRU -there is a peak in the reconstruction error, usually around k ≈ 5.This peak can be understood by noting that the tokens x t get more concentrated around the origin as k is increased, as we have shown in eq. ( 13).Therefore, x t is less correlated with c t in the highly non-Markovian limit.However, as the degree of non-Markovianity is increased through increasing k, the history of the tokens position x t can help with predicting the trap's position c t more accurately.The trade-off between these two phenomena is evident in the nonmonotonic dependency of the error as a function of k for models with larger integration window in fig. 5.For small k, the correlation between different times in the sequence is not strong enough to compensate the loss of information about the trap's position in the signal, but for larger k the trap's position is more regular and the history of x t can be used to infer the c t more efficiently.

Concluding perspectives
We have applied the stochastically switching Ornstein-Uhlenbeck process (SSOU) to model seq2seq machine learning tasks.This model gives us precise control over the memory of the generated sequences, which we used to study the interaction between this memory and various neural network architectures trained on this data, ranging from simple auto-regressive models to convolutional and recurrent neural networks.We found that the accuracy of these models is governed by the interaction of the different time scales of the data, and we discovered an intriguing, non-trivial interplay between the architecture of the students and the sequence which leads to non-monotonic error curves.
An important limitation of the SSOU model considered here is that it only allows one to study correlations that decay exponentially fast.As a next step it would be interesting to explicitly consider the impact of long-range memory in the data sequence.This could be implemented in the SSOU model e.g. by choosing a long-ranged waiting time distribution, such as a power law.Another limitation concerns the linearity of the model analysed here.We speculate that a nonlinear model could provide insights into the different ways AR and CNN utilise the data.In our model, one could implement a tunable nonlinearity by considering e.g. a double-well potential with increasing amplitude of the potential barrier.Similarly, to enrich the statistical dependencies within the sequence, one could generate non-symmetric sequences with respect to a x → −x sign inversion, by implementing two different waiting-time distributions.It would also be interesting to analyse the impact of sequence memory on the dynamics of simple RNN models [52,53], such as those with low-rank connectivity [54,55].
Another exciting avenue would be to apply neural networks to noisy non-Markovian signals extracted from experiments in physical or biological systems [56][57][58][59][60][61][62].Examples include the recent application of the SSOU to infer the mitigation of the effects of non-Markovian noise [63,64] and the underlying heat dissipation of spontaneous oscillations of the hair-cell bundles in the ear of the bullfrog [31].Applying our techniques to such a biological system could be fruitful to decipher the hidden mechanisms and statistics of switching in hearing and infer thermodynamic quantities beyond energy dissipation.can be calculated explicitly according to where Q n ≡ −4(1 − q n )/(k q n ) 2 , and q n ≡ 1 − e iπ(1+2n)/k with n = 0, . . ., k − 1; in fig.6, we compare eq.(A.6) with the numerical estimate of C X for three different choices of k.Note that the value of the sum k−1 n=0 Q n = 1 implies the correct initial condition C c (0) = C 2 0 .Moreover, one can deduce the stationary expression of the stationary variance ⟨x 2 t ⟩ of the process X t from the initial value C X (0).We conclude this section by reporting the value of the covariance ⟨x t c t ⟩, which we derive here using similar methods as those in ref. [31], and is given by

A.3 Complexity of the models
In this section we analyse the size and computational complexity of the models used in this study.We note that the exact complexity depends on the specific implementation of the models.Therefore, we first analyse the scaling of the model complexity with respect to the number of model's memory units and then characterise the number of floating-point operations (FLOPs) in our particular implementation.
In this scaling analysis we do not consider the scaling with the sequence size it is the same for all of the models.The auto-regressive (AR) models we examine feature a 1D kernel of size W .Both the model's parameter count and computational complexity for its forward pass increase linearly with W .Similarly, 1D convolutional neural networks (CNN) comprise f kernels of size W .For these models, the number of parameters and computational cost scale primarily as W f .The gated recurrent units (GRU) have a hidden state of size d.The number of parameters and the computational cost of these model scale with d 2 .We analyze the size and computational complexity of the models used to generate fig.5, i.e., auto-regressive (AR), convolutional (CNN), and gated recurrent neural network (GRU) architectures.Computational complexity is measured in terms of floating-point operations (FLOPs).The 'memory units' in these architectures correspond to the kernel size (W ) in AR and the GRU hidden state size (d).Notably, all CNN models employ 10 filters in their first layer.
We employ software profiling to accurately compare the size and computational cost of the models used to produce fig. 5.As demonstrated in fig.7, our analysis agrees with the numerical results.The AR models exhibit the smallest size and computational cost, followed by CNNs, and finally RNNs.Additionally, our simple analysis correctly predicts linear scaling of cost with the number of parameters.

B Additional experimental results
We use a fixed number of 40 epochs and do not perform any hyperparameter tuning (with the exception of the GRU(1) model as detailed in Appendix B) as the models are rather simple and varying these parameters have negligible effects on the results.

B.1 Additional details on creating the figures
In this section, we provide additional details on the generation of some of the figures of the main text.

B.2 Ensuring the consistency between continuous-time theory and sampled sequences
Here we discuss how we ensured that our theory that is based on the continuous-time description of the stochastic processes X t and C t (cf.section 2) and the sequences we find into the machine learning models are consistent.If the sequence is sampled with a very small time step, the student almost never sees a jump of the trap, and effectively samples from an (equilibrium) distribution in one trap.On the other hand, in the limit of very large time steps, the temporal correlations in the sequence is not visible, e.g., recall that the non-Markovianity parameter eventually decays to zero for infinitely large time differences.Thus, it is crucial to sample with a time step that is smaller but comparable to the characteristic time scales of the data sequence.In practice, one would thus need to estimate those time scales in a preceding data analysis step, before feeding the data into the network.For this purpose, a suitable method is to compute the autocorrelation function of the input sequence data which reveals the time scales.For the SSOU model, we provide the analytical expression for the correlation function in eq. ( 13).For the SSOU, the characteristic time scales are given by the oscillation period and the decay time of the autocorrelation function, which are of order of magnitude 1 for our parameter choice (see fig. 6 ).

B.2.1 Correlations function and integration time step
To generate trajectories with correct statistics we vary the time differences ∆t, which corresponds to the discrete version of dt in numerically integrating eq. ( 1).We find that ∆t = 0.005 reproduces the correct statistics in fig.6.

B.2.2 The role of subsampling
As mentioned earlier the trajectories are first generated by numerically integrating eq. ( 1) by choosing a sufficiently small dt such that the statistical properties of the trajectories match their theoretical values, which we verified by checking the correlation functions estimated analytically match our theoretical result (see fig. 6).
As we discuss in the main text, we subsample the full trajectories before training the machine learning models.This subsampling also simplifies the optimization and accelerates the training.However, in doing so, we need to make sure that the qualitative properties of the model remain unchanged.Therefore, we compare AR models trained on the original trajectories with those trained on subsampled trajectories.We observe that the key properties of the problem, namely the strictly increasing errors with k in the memoryless learning (W = 1), and non-monotonic behaviour of errors due to the trade-off between the memory (W ) and the non-Markovianity (k) in more complicated models are preserved.
As mentioned in the main text we subsample the sequences by taking every sth element of the original sequence in the dataset when training the models in Sec.4.3.To investigate the effect of this subsampling we compare AR models trained on original dataset with those trained on subsampled sequences with s = 30, and show the results in fig.8.The qualitative features of the error curves as a function of non-Markovianity k are preserved.The monotonicity of errors for memoryless models and the trade-off between improvements by memory and the difficulty of correlating the particle's position x t to the trap's position c t , the two main features of fig. 5, are present in both the original and the subsampled cases.

B.3 The thresholding algorithm
As a baseline benchmark, we consider a simple thresholding algorithm, where ĉt = sgn(x t ).Put simply, this algorithm infers the potential's location by only considering the position of the particle at a single point x t and choose the closest configuration of potential c t to that position.In fig. 9 we compare the performance of the AR(1) model with the thresholding algorithm and observe that AR(1) matches the performance of the the thresholding algorithm.

B.4 Selecting the best GRU(1) model
To find the best performing GRU(1) we train model for all k values shown in fig. 5 we train five different instance of the network initialized randomly using as suggested in Ref. [65].We choose the best

Figure 3 :
Figure 3: Performance of auto-regressive AR(2) models for Markovian and non-Markovian sequences of the SSOU model across its "learnability phase diagram".Reconstruction error of the AR(2) model (in % of correctly predicted trap positions) for different values of the diffusive time scale t diff = 1/2D and the relaxation time t κ = 1/κ, see eq. (9).We show these "phase diagrams" for (Left) Markovian sequences (k = 1) and for (Right) non-Markovian sequences (k = 5).In these diagrams, the two axes represent ratios of time scales, t diff /t κ (vertical axis) vs t κ /⟨τ ⟩ k = 1/κ (horizontal axis), since ⟨τ ⟩ k = 1 for all k.The bottom row depicts the distribution of particle's position p(x t ) across the dashed line superimposed with the error heat map.The density shows a clear transition from a bimodal distribution to a unimodal one.Parameters: total simulation time for each parameter value τ = 30, simulation time step ∆t=0.01,AR(W ) window size W = 2, κ varying from 0.1 to 0.6, τ h = 2, remaining training parameters as in fig. 1.

Figure 4 :
Figure 4: Statistical characterisation of the error (Left) Scatter plot of the error of several AR(2) models trained on Markovian sequences versus the Sarle coefficient, a measure of the bimodality of the distribution p(x t ), eq.(11).(Right) Scatter plot of the error of several AR(2) models trained on Markovian sequences versus the excess kurtosis, eq.(12), another measure of the bimodality of the distribution.Parameters: Similar to those used in the right panel of fig.3

Figure 5 :
Figure5: The impact of sequence memory on recurrent and convolutional neural networks.(Left) Analytical predictions for the covariance of the particle and trap position, eq.(13), in the non-Markovian case as a function of k.As we increase the non-Markovianity, correlations between x t and c t decay, complicating the reconstruction task.(Right) Prediction error ϵ for students with different architectures: auto-regressive (AR), convolutional (CNN) and gated recurrent neural network (GRU).The "memory units" of the architectures correspond to the size of the kernel (W ) in the AR and the first layer of CNN models, and the number of units (d) in the hidden state of GRU models.The CNN models all have f = 10 filters in their first layer.Parameters: Similar to fig.3with κ = 2.

Figure 6 :
Figure 6: Autocorrelation function of the process C t .We compare the analytical formula of C X (t) in eq.(A.5) (solid line) with its numerical prediction (circles), calculated by generating N = 5 × 10 4 trajectories via Euler-Maruyama method with ∆t = 0.01, κ = 2, D = 0.5 and various values of k.

Figure 7 :
Figure7: Complexity of the models.We analyze the size and computational complexity of the models used to generate fig.5, i.e., auto-regressive (AR), convolutional (CNN), and gated recurrent neural network (GRU) architectures.Computational complexity is measured in terms of floating-point operations (FLOPs).The 'memory units' in these architectures correspond to the kernel size (W ) in AR and the GRU hidden state size (d).Notably, all CNN models employ 10 filters in their first layer.

Figure 10 :
Figure 10: Validating the GRU(1) model.The accuracy of the best performing GRU(1) models over the cross validation and test sets match.