Accuracy vs memory advantage in the quantum simulation of stochastic processes

Many inference scenarios rely on extracting relevant information from known data in order to make future predictions. When the underlying stochastic process satisfies certain assumptions, there is a direct mapping between its exact classical and quantum simulators, with the latter asymptotically using less memory. Here we focus on studying whether such quantum advantage persists when those assumptions are not satisfied, and the model is doomed to have imperfect accuracy. By studying the trade-off between accuracy and memory requirements, we show that quantum models can reach the same accuracy with less memory, or alternatively, better accuracy with the same memory. Finally, we discuss the implications of this result for learning tasks.


I. INTRODUCTION
The ability to learn from experience and to make predictions about possible future outcomes is crucial in all quantitative sciences.Since humans and machines have a limited amount of memory, learning complex processes requires distilling and storing only the relevant information from the training data that is useful for making predictions.For temporal data, the current state-of-the-art in classical machine learning is based on the transformers architecture [1], which uses the self-attention mechanism to dynamically focus on what is relevant in the data stream.Such an architecture culminates a series of tweaks performed by the machine learning community over the last decades to solve practical problems, making it difficult to extract the underlying mathematical principles, in spite of some progresses [2].On the other hand, stochastic processes called ε-machines have been developed on rigorous mathematical grounds, to formally define what past information a learner needs to store for predicting future outcomes [3][4][5].However, numerically fitting the model from data is more complicated [6].
From a quantum information perspective, whenever a stochastic process can be exactly expressed as an εmachine with a finite amount of memory, it has been shown that a unitary quantum simulator is capable of exactly simulating the same process, asymptotically storing less memory [7][8][9][10].In other terms, quantum advantage in memory use can be achieved by using quantum states and probability amplitudes, which provide other sources of stochasticity when the memory states are not orthogonal.Moreover, this advantage has also been observed in tensor network simulations [11][12][13].
In this work we study whether the quantum advantage persists when we relax the requirement of exact simulation.Real-world datasets, in general, cannot be exactly modelled as an ε-machine with a given amount of memory.Therefore, it is unclear whether we can formally expect quantum advantage with them.Nonetheless, start- * leonardo.banchi@unifi.iting from a generic stochastic process, it is reasonable to assume that there exists a larger ε-machine, possibly with infinite memory, capable of exactly modelling the data.The latter can be exactly expressed as a unitary quantum simulator, which typically shows memory advantage and never requires more memory [9].Does the advantage persist in constrained memory scenarios?If this were the case, quantum simulators of real-world stochastic processes could reach the same accuracy of classical ones with less memory, or, alternatively, achieve better accuracies with the same amount of memory.Moreover, since the information contained in the quantum memory constraints the generalization error [14,15], it is tempting to expect that a memory advantage results in the ability to learn the model with less data.Motivated by the above questions, and given the difficulty in extracting general predictions from real-world data, we focus on toy problems that are easier to interpret and model.Instead of the mentioned bottom-up construction, where, given a real-world problem and a learner with a constrained memory, we may assume that there exists an abstract ε-machine in a larger space, here we consider a top-down approach: we start from stochastic processes that can be expressed as an ε-machine, and apply quantum or classical compression methods to reduce the memory, at the cost of losing the accuracy in future predictions.Since compression typically destroys the ε-machine structure, we lose the direct mapping between the classical and quantum simulators, which may then display different accuracies.We introduce different compression methods, adapted from the tensor network literature, and discuss several figures of merit to investigate the trade-off between simulation accuracy and memory use, finding that the quantum advantage persists even in constrained memory situations.
Finally, we consider whether the found quantum advantage in the top-down modelling results in a similar advantage in the bottom-up approach, which is closer to real-world scenarios.We train a model with a constrained memory via maximum likelihood and find that quantum models have higher accuracies.outcomes Pictorial representation of the evolution for the first two time steps.Both the classical and the quantum simulators use two registers: an outcome register for the emitted symbols xt, and a memory register that keeps a compressed representation of the history of previous interactions and outcomes.At each time step, the outcome register is first reset to a reference value, then it is let to interact with the memory (orange box), and finally it is measured to observe the classical outcomes xt.For classical simulators, the orange box is mathematically modelled as a transition probability, Eq. ( 1), while for quantum simulators the orange box models a unitary operation, as in Eq. ( 8), followed by a projective measurement to extract xt.

II. NOTATION AND BACKGROUND
We consider systems described by random variables X t , for t ∈ Z, which can take some discrete values x t ∈ {1, . . ., d} for some positive integer d.For clarity, we will refer to t as a time index, so the x t 's describe outcomes at different times, but the same formalism can model spatial correlations in one-dimensional systems [4,16].We assume that correlations between outcomes at different times (possibly far away) can be completely described via an auxiliary memory state, which is another time-dependent random variable S t with outcomes s t = {1, . . ., D} for some discrete memory dimension D.
The model works as follows: starting from an initial memory state distribution p i α = P (s 0 =α), the model emits the first outcome x 1 and then updates the internal memory state to a new value.This process is repeated at multiple times and an outcome sequence x 1 , x 2 , . . . is generated.At a general time t, the model only knows the previous memory state and not the entire history of outcomes and memory states.The transition probability from the previous memory state s t−1 to the outcome x t and the new state s t is described by the transition probability tensor where Latin and Greek indices repectively run from 1 to d or from 1 to D. The above equation defines conditional probabilities, so in general x,α T x α,β = 1 for each β.See Fig. 1 for a pictorial representation of the temporal process.
The main assumption in Eq. ( 1) is that T is independent on t, namely that, at each time, knowledge of the memory state is sufficient to predict the outcomes.Models with these assumptions are called Hidden Markov Models (HMM), see e.g.[17], since the evolution of the memory state at time t only depends on the memory state at time t − 1 (and on the outcome x t ).In spite of the name, the distribution of the observed outcomes may be highly non Markovian, due to the hidden memory states.
Models described by (1) are called "edge emitting" HMM [18], while it is customary for standard HMM to satisfy another property, namely that the state s t is independent on the outcome x t .For such processes, T factorises as T x αβ = J αβ E x β , namely as a product of the transition matrix J αβ = x T x αβ and an emission matrix E x β = P (x t =x|s t−1 =β) = α T x αβ .Such decomposition is particularly important in machine learning applications, since the E and J matrices can be reconstructed from data using the Expectation-Maximization (EM) algorithm [17].In this paper, we will consider the more general definition from Eq. (1).
Ignoring the internal memory, the probability of observing the data sequence {x 1 , . . ., x L } takes the matrix product form where each T x is a matrix with indices as in Eq. ( 1), and (v) α is the αth component of vector v.
The initial memory state may be known from first principles or may be reconstructed from previous data.Suppose for instance that the data sequence can be split as an observed past {x t } t≤0 and an unknown future x 1 , . . ., x L .For a given transition tensor T the initial memory probability can be reconstructed from the rules of conditional probability Indeed, using the matrix product form (3), it is simple to realize that the exact future distribution given the past can be obtained by replacing p i in Eq. ( 3) with an initial probability p i|past that, up to a normalization factor, can be reconstructed as where p i0 is the initial probability of the first observed event in the past.Without prior knowledge, p i0 can be considered uniform.The average state E past p i|past , in the limit of many past observations, converges to the steady state π of the transition matrix J, namely the right-eigenvector of J with largest eigenvalue -for simplicity we assume that it is unique.
If we are interested in predicting future observations from a known past, we should keep the information from past observations that is relevant for predicting the future, nothing more.In the asymptotic sense, the information about the past can be quantified by the Shannon entropy H(past) = − α π α log 2 π α of the steady state (6), while the relevant information for predicting the future can be quantified by the mutual information I(future; past) or by the conditional entropy H(future|past) = H(future) − I(future; past).The provably optimal model [3,4] in this information theoretic sense, dubbed ε-machine, is the one that minimises H(past) so that H(future|past) = 0, namely there is no uncertainty about the future given that the past is known.The memory compression in ε-machines is based on the intuition that we should not distinguish different pasts if they produce the same future.Mathematically this is performed by introducing an equivalence class ∼ such that two different past observations x past and x ′ past belong to the same equivalence class, namely x past ∼ x ′ past , if P (future|x past ) = P (future|x ′ past ) for arbitrary future observations.The equivalence class is the only information we need to save in memory to predict the future.For complicated stochastic processes, the memory requirement may be infinite, though in this paper we only consider processes with finite memory.Numerically approximate ε-machine can be reconstructed from data using the Causal-State Splitting Reconstruction algorithm [6].
The mapping ε(x past ) from past observations to the corresponding equivalence class is deterministic.Hence, upon emitting a new observation x t+1 the new memory state is deterministically obtained from the previous memory state and x t+1 .In other terms, for ε-machines the transition probability (1) takes the form where ♯(x, β) is a function mapping the previous memory state β and the emitted outcome x into a new memory state, and δ is the Kronecker function.HMMs satisfying the above property are called unifilar [18].

A. Quantum Simulation
Quantum simulators of HMMs have been proposed, both for unifilar [7,12] and not-unifilar [9] processes.In both cases, the mapping exploits a quantum memory register, indexed by some normalized but non-orthogonal vectors {|σ α ⟩}, in place of the classical memory, where α = 1, . . ., D, as before.In the unifilar case, the unitary simulator of an HMM with transition tensor T is defined as where the outcome register |x⟩ is initialized in a reference state (say |0⟩) -see also Fig. 1.By iteratively applying L times such operator U , each time measuring the reference state and reinitializing it in |0⟩, we get a state that acts on L tensor copies of the outcome Hilbert space, and a single memory space.More precisely we can expand where we have defined In the above equations we have assumed an initial memory state |σ⟩ and introduced the norm in the nonorthogonal basis For unifilar processes, after tracing out the memory index, the probability that we get from Eq. ( 11) is equal to the probability of the data sequence (3), in other terms P (x 1 , . . ., x L ) = P A (x 1 , . . ., x L ).This is a consequence of Eq. ( 7), namely that, for a given input memory state and emitted outcome, the next memory state is deterministic.Accordingly, the sums over the memory indices in Eqs. ( 2) and ( 10) are removed, as there is only a possible string of {α t }, and after taking the square in Eq. ( 11), for a suitable input state, we recover exactly the expression of Eq. ( 2).
In order to define the mathematical properties of quantum simulators it is more convenient to normalize the tensors in Eq. ( 10) so that they define normalized Kraus operators [19].Let G αβ = ⟨σ α |σ β ⟩ be the Gram matrix of the memory states.From the unitarity of the operator U in Eq. ( 8) we get G αβ = ⟨0, σ α |U † U |0, σ β ⟩, which leads to the following equation namely that G is the fixed point of the map Since Gram matrices are positive, we may decompose them as G = W † W , where W αβ = ⟨α|σ β ⟩ and |α⟩ are orthogonal vectors.Accordingly, the operators define Krauss operators of a completely positive map, since from Eq. ( 12) we get x K x † K x = 1 1.Moreover, since in matrix product states the actions of W and W −1 cancel out, the probability of outcomes is the same.More precisely, using W we may write the norm in Eq. ( 11) as ∥|ψ⟩∥ 2 σ = ∥W |ψ⟩∥ 2 2 = ⟨ψ|G|ψ⟩.Accordingly, setting |ϕ⟩ = W −1 |σ⟩/∥W −1 |σ⟩∥ and defining then we get P K (x 1 , . . ., x L ) = P A (x 1 , . . ., x L ).The above evolution can be generally modelled as in Fig. 1 with unitary interactions between the memory and outcome registers.A more general case is discussed in Appendix A.
We now focus on the memory state, given that we have obtained some observations {x t }.From Eq. ( 9), given a trial initial state |ϕ⟩ we see that this is given by appearing with probability P K (x 1 , . . ., x L ).The average memory state after L observations is then where Π is the fixed point (steady state) of the map . This state can be obtained from the steady state ( 6) of the classical model by noting that, from the definition of A and Eq. ( 6), we get x A x diag(π)A x † = diag(π).This may in turn be rewritten as so the steady state of the map E K takes the form Note that, since W is not a unitary matrix, the classical steady state does not define the eigenvalues of the quantum steady state in (19).This is the main difference responsible for the memory advantage in the quantum simulation of classical stochastic processes [7,12].Indeed, since the entropy of a mixture of pure quantum states is smaller than the entropy of the mixture, where H(Π) = − Tr[Π log 2 Π] is the Von Neumann entropy of the steady state.Equality in (20) arises only when |σ α ⟩⟨σ α | are orthonormal [19].When this is not the case, quantum simulators can use significantly less memory to model the same classical stochastic process with exact accuracy [9,12].The key intuition is that ε-machines define in many cases an irreversible process and, in many standard examples, H(π) is strictly greater than I(past; future), which quantifies the ultimate amount of needed information.Improvements in the quantum simulation arise from addressing the source of irreversibility within quantum dynamics [7].
Beyond the unifilar case, unitary simulation of an HMM with transition tensor T is possible by using matrix product density operators, as shown in Appendix B. In this work we only consider quantum simulators based on matrix product states, where the formal mapping between classical and quantum simulators is valid for unifilar HMM only.Nonetheless, we will assume a structure Compute the steady state ΠA as the eigenoperator with largest eigenvalue 1 of the map Find the spectral decomposition of ΠA = W λW † , where W is unitary and λ is diagonal with diagonal elements sorted in decreasing order.

4:
Define the tensor B with elements , where P ′ is a a D ′ × D projection operator with non-zero elements P ′ ii = 1.Compute G as the eigenoperator of the map x whose corresponding eigenvalue µ has largest |µ|.

3:
Find the spectral decomposition of G = U sU † , where s is diagonal and positive semi-definite.

4:
return the tensor B with elements (10) and study the accuracy of the resulting simulator for general (possibly not-unifilar) stochastic processes.

III. MEMORY COMPRESSION
Here we study whether the memory advantage of quantum simulators persists even when we relax the requirement of exact simulation.We introduce different compression methods for either classical or quantum simulators and introduce different measures of accuracy.

A. Memory compression of quantum simulators
Compressions methods for matrix product states were discussed in several papers, see e.g.[20][21][22].Here we focus on adapting the spectral compression method from [20] in order to maintain a high overlap between the original and compressed states, which is expected to provide a large fidelity and, accordingly, an accurate prediction of future observations.Suppose that we have an exact quantum simulator |ψ Compute the transition matrix J αβ = x T x αβ , the steady state Jπ = π and the recovering and compression matrices from Eqs. ( 21)-( 23). 3: return T ′ 5: end function the properties of the original simulator are approximately maintained.For instance, calling we want the asymptotic states of the two simulators to share a similar amount of information, namely H(Π A ) ≃ H(Π B ).With these goals in mind, we define Algorithm 1.The main ideas behind such algorithm are as follows: we first transform the tensors A with a change of basis, in such a way that the equilibrium memory state Π A is diagonal, with diagonal elements λ i sorted in decreasing order.In this basis we then truncate the memory indices, selecting the first D ′ elements.The obtained tensors are then normalized using Algorithm 2. If the truncated elements λ i are small and the normalization operation does not alter too much the tensors (which can be expected from perturbation theory [23]), then the steady state of the resulting tensor is expected to have a similar memory entropy.

B. Memory compression of classical simulators
Different methods have been proposed to reduce the memory requirements of classical Markov chains [24].Inspired by Algorithm 1, we develop two memory compression methods for classical HMMs.The first one is based on an encoding/decoding protocol aimed at preserving the entropy of the steady state, while the second one, presented in Appendix C is based on spectral compresion methods adapted from [25,26].
Consider the steady state of the transition matrix defined in Eq. ( 6).As described in the previous section, Algorithm 1 has the desired properties of compressing the state space while keeping as much information as possible about the asymptotic states.We can develop a classical compression method with a similar property as follows.Let D ′ < D be the reduced dimension and assume that π α are in decreasing order.Setting for α ′ = 1, . . ., D ′ as the compressed steady state, we aim at designing a coding and decoding strategy so that π ′ is the steady state of the compressed HMM.We can define a recovering protocol as a D ×D ′ transition matrix R α,α ′ = P (α|α ′ ) so that Rπ ′ = π as follows Notice that, by construction, R defines a conditional distribution and α R α,α ′ = 1.The encoding protocol can now be constructed according to the Bayes rule The resulting procedure is summarized in Algorithm 3.
C. Quantifying the accuracy of a compressed quantum simulator Suppose we have two quantum simulators |ψ[A]⟩ and |ψ[B]⟩, normalized in such a way that A x and B x define two different sets of Kraus operators.In what follows we assume that |ψ[A]⟩ provides an exact simulation of the process, while |ψ[B]⟩ is an approximation, using a reduced memory.The statistical distance (e.g. the Kullback-Leibler or Bhattacharyya distance) between the two future outcomes distributions, P A (x 1:L ) and P B (x 1:L ), can be computed explicitly for a given reasonably small L, but its numerical complexity grows exponentially with L. Because of this, we introduce a different figure of merit based on the fidelity between quantum states, which can be computed efficiently for matrix product operators [27], allowing us to study even the limit L → ∞.Similar measures have been previously considered in [28].
We first note that, ignoring the memory index in (9), we get a mixed state ρ , whose diagonal elements are exactly P A (x 1:L ).From the data processing inequality, F (ρ[A], ρ[B]) is smaller than the Bhattacharyya distance between P A and P B .Using the Uhlmann's theorem and the matrix product state expansion we may write where U is a unitary matrix acting on the memory space, |0 p ⟩ is a "padding" ancillary state to make the dimension of the memory registers equal, ∥X∥ p = p Tr(X † X) p is the Schatten p-norm, |σ A/B ⟩ are the initial memory memory states of |ψ[A/B]⟩ and we have defined the linear map Note that, in general, E A,B maps rectangular operators into rectangular operators, since the A x and B x may have different dimensions.For large L the operator power E L A,B [Y ] can be approximated as is the eigenvalue of E A,B with largest absolute value, and Λ L/R are the corresponding left/right eigenoperators.Assuming that λ A,B is unique we then get This derivation also provides a way of selecting the compressed memory state |σ B ⟩ given the initial state |σ A ⟩. Indeed, it appears from Eq. ( 27) that the optimal state is that maximising From the fidelity we may define the Rényi divergence between two density matrices [29], as ).Our first measure of accuracy is then the divergence density (see also [28]) where, as before, λ A,B is the eigenvalue of the map (28) with largest absolute value.The above divergence measures the asymptotic decay rate of the fidelity, without considering the effect of the initial memory state.To consider the effect of past observations without making any assumptions on the initial guess and observed data, we focus on the average asymptotic behaviour described by the steady state (19).Calling |Π A/B ⟩ the purification of the steady states of maps E A/B and extending the derivation of Eq. ( 27) we get where I is an identity channel, and ρ Π refers to a simulator with initial memory state described by the steady mixed state (18).

D. Quantifying the accuracy of a compressed classical simulator
For quantum simulators we use the fidelity F (ρ[A], ρ[B]) to quantify the accuracy between the exact and compress quantum simulators.The closest classical measure is the Bhattacharyya coefficient B[T, T ] = {xt} P (x 1:L ) P (x 1:L ), where P is alike P in Eq. ( 2) but with truncated transition tensor T .However, B[T, T ] is difficult to compute for larger L since we cannot make explicit use of the matrix product form (3) for avoiding to sum over the d L outcomes.Other measures [30] based on the evaluation of the inner product P • P = {xt} P (x 1:L ) P (x 1:L ) can be explicitly computed via matrix powers of the linear map E T [•] = x T x (•) T xT .The latter can be done efficiently in a time the scales linearly in L, rather than exponentially.Examples of such similarity measures include the cosine similarity P • P / (P • P )( P • P ), which was also considered in [28].

IV. NUMERICAL RESULTS
We focus on stochastic processes with an exact representation as an ε-machines with finite memory, for which the quantum advantage in the exact simulation can be formally assessed [7], and we study whether such an advantage persists when we relax the assumption of exact simulation, by reducing the available memory.
We consider the ε-machine of discrete renewal processes [31] of period N , a kind of stochastic clock where the number of 0s between two consecutive "ticks" (with outcome 1) is uniformly distributed between 0 and N −1.
As N increases the process becomes increasingly more non-Markovian as we need to store the last N entries to predict future statistics.It is clear though that this can be represented as a HMM with memory N , since we can simply store the number of 0s since the last tick.
The transfer matrix of this process is given by where k = 1, . . ., N .The normalized Kraus operators of the quantum simulator have the analytic form [12] In Fig. 2 we compare the performance of the exact and compressed quantum simulators, obtained using Algorithm 1.We initialize both simulators in the memory states i √ λ i |λ i ⟩ where λ i and |λ i ⟩ are, respectively, the eigenvalues and eigenvectors of their steady state (18) and, for each simulator, we compute 10 8 samples.Getting so many samples is easy, because the probability distribution ( 14) can be factorized.The full procedure is illustrated in Algorithm 4.
As shown in Fig. 2, the exact simulator reproduces almost perfectly the expected flat distribution in the distance between ones, with the differences due to the finite number of samples.The compressed quantum simulators, where the memory dimension has been cut to half the original one, still captures the main features of the distribution, though with some fluctuations.As expected for the chosen stochastic process, errors tend to be larger when the distance between ones is close to the maximum value N .We note in particular that the distribution oscillates around the expected value, and that the amplitude of the oscillations increases for larger distances between

5:
end for 6: return observations x1, . . ., xL. 7: end function ones.Around the maximum distance N , there is a significant drop with a large deviation from the expected probability.
The entropy of the exact and compressed simulators are respectively 1.26 and 1.16 for N = 64, 1.23 and 1.07 for N = 32.Therefore, although the compressed simulators have a halved memory space, the entropy of the steady states are reduced by just 8÷13%.This shows that the spectral compression method introduced in Al-gorithm 1 achieves the desired goal of reducing the number of memory states, while keeping a similar amount of entropy and maintaining the ability to predict future outcomes.To put things into perspective, we also compute the entropy of the truncated steady states.Namely keeping half of the largest eigenvalues of the steady states, normalizing them and computing the entropy we get 1.19 for N = 64 and 1.13 for N = 32.The further reduction provided by Algorithm 1 is therefore given to the normalization procedure of Algorithm 2, that reduces the entropy by a further 5% compared to the optimal case.
In Fig. 3 we study how entropy and accuracy, as quantified by the fidelity (27) or by its asymptotic decay rate (29), change as a function of the truncated memory.In Fig. 3(b) we see that the divergence D 1/2 = −2 log F , where F is the fidelity between exact and truncated simulators, is negligible, so we can focus on the decay rate shown in Fig. 3(a).For instance, with a decay rate H = 0.004 the fidelity behaves as F ≈ e −0.004L and reaches F ≈ 50% for L ≃ 170.Therefore, we need to make predictions far ahead in the future to observe a significant reduction in accuracy.In Fig. 3(c) we study the decay in memory for increasing N , and observe an almost linear decrease as a function of − log 2 (M/N ).When M = N the simulation is exact, while for M = 2 (rightmost points for every color) we use the minimum amount of memory.From the monotonic behaviour of Figs.3(b) and (c), we understand that the entropy vs accuracy trade-off curve shown in Fig. 3(a) is explored from left to right by reducing M from N to 2. This curve shows the performances of the compression Algorithm 1 in keeping the accuracy of quantum simulators while reducing the memory cost.
Finally, in Fig. 4 we compare the performance of Algorithms 1 and 3 in compressing, respectively, the exact classical and quantum simulators of discrete renewal stochastic processes with N = 32.Since for classical simulators we cannot compute the fidelity, we focus on the Bhattacharyya coefficient B and the similarity measure S introduced in Sec.III D. The Bhattacharyya coefficient is computed by first generating a dataset of 1000 samples using Algorithm 4, setting the memory according to the Bayes rule Eq. ( 5), and then predicting the probability of future observations x 1:L for L = 10.On the other hand, the similarity is expected to decay as S ∝ e −λL as the number of future predictions L, alike the fidelity.Therefore, we plot the rate λ, that can be estimated from the largest eigenvalues of the maps E T introduced in Sec.III D, which is independent on the memory state.We call discrepancy the quantity − log B, which is different from zero when B ̸ = 1.As shown in Figs.4(a,b), the compressed quantum simulator has an almost zero discrepancy for large M , and a negligible decay rate for M ≳ 12, while the classical simulator always displays a larger discrepancy and decay rate, which both start to deviate from zero already at M = 31.In Fig. 4(c) we compare λ with the entropy of the steady state, showing in both cases an increase of the decay rate for reduced entropy.For quantum simulators, there is a significant increase when the memory has approximately 1 bit of information, while for classical simulators, that normally display a higher entropy, there is a plateau followed by a decrease when the memory has more than 4 bits.From the above analysis we can conclude that, at least for the considered compression techniques, the quantum advantage in simulating the discrete renewal process persists even when the memory dimension is not enough for ex- act simulation.

A. Learning data sequences
In the previous section we have shown that quantum simulators have the potential to model stochastic processes with higher accuracy for a given memory or, in other terms, reach the same accuracy with less memory.This outcome was obtained by employing tensor network inspired compression methods applied to exact quantum and classical simulators, and is therefore algorithmicdependent.
In this section we test the performance of quantum and classical algorithms on real datasets, namely starting from a training data sequence, and learning the optimal classical and quantum simulators.In the classical case, we employ the standard expectation-maximization algorithm (Baum-Welch) [17], while in the quantum case we adapt gradient based techniques [32,33] to work with exponentially small probabilities.We focus on the loglikelihood cost function [11,13] L = − log P K (x 1 , . . ., x L ), (33) where P K was defined in Eq. ( 14) and x 1:L defines the training set of L past observations.For simplicity, we set the initial memory state as |ϕ⟩ = |0⟩.When L is large, the probabilities P K can be extremely small and cause numerical instabilities and overflows.To fix this, we note that the direction G x of the steepest descent can be expressed as where we used that L is a real function of complex matrices K x , so the direction of the steepest descent corresponds the gradient of the complex conjugate [32,34], and, inspired by the classical Baum-Welch algorithm, we have defined the "forward" and "backward" states Since the above states are iteratively normalized, the numerical instabilities due to the small probabilities are removed, and all terms in (36) are numerically well-defined.
From the normalization coefficients we can also write the log-likelihood as to check for convergence, without having to deal with exponentially small probabilities as in Eq. (33).By naively following the steepest descent direction, the updated Kraus operators do not define a completely positive map.There are two possible solutions to this problem.The simplest one involves the conjugate gradient method [33], where after updating the parameters as K x → K x +ηG x with a suitably small learning rate η, the new states are normalized to satisfy x K x † K x = 1 1.In practice, here we use the mpsnormalize function from Algorithm 2. For the second approach, we note that, by concatenating the Kraus operators K = [K 1 . . .K d ] into a dD × D matrix K, the latter satisfies K † K = 1 1.A single K is therefore an isometry, and the set of isometries define the Stiefel manifold [35].Using tools from Riemannian geometry, the update rule can be defined as a curve K(η) on the Stiefel manifold, which starts from K and follows the direction defined by the projection of G x onto the tangent space around K. Although the are several choices for the above curve, here we follow the recipe by Wen and Yin [35] and define where and Note that, when d is large, it is convenient to manipulate the above formulae to make them depend on smaller matrices [35].Numerical results are displayed in Fig. 5 and show the superior performance of quantum simulators, confirming similar results observed for other stochastic processes [11,13].In particular, in Fig. 5(a), where the fitted classical and quantum models have the same memory dimension of the exact simulator, we see that the classical fit has long tails, while the quantum one reproduces the expected behaviour, though with some fluctuations.In Fig. 5(b), where the memory dimensions of the learnt classical and quantum models are constrained to be half that of the exact simulator, we see that the differences are less dramatic, though both classical simulators, e.g. the fitted one and the compressed one, keep having longer tails and overall display a larger deviation from the true distribution.These larger deviations are summarized in Fig. 5(c), where we observe that the Bhattacharyya coefficients of the quantum simulators are always significantly higher.We expect that better generalization capabilities, namely better abilities to predict the future given the past, can be obtained by introducing regularization terms in the cost function, e.g. based on the information bottleneck [14,15].These regularization terms should enforce a better exploitation of the memory, with the ultimate aim of letting the learnt model save the relevant features from the history of past observations to predict future outcomes.
Finally, we apply the fitting technique to real-world data, using the Lymphography dataset [36].We fit the lymphatics feature using either classical or quantum models with different dimensions and use the model to predict the last 10 values given the history of past observations.Since gradient descent converges to a local optimum, we repeat each numerical experiment 100 times with different random configurations.The results are shown in Table I where the quantum advantage is clear.
In particular, the quantum models display consistent results while the classical ones often converged to poor optima, as shown by the low value of the median.For D = 64, the classical models always converged to poor optima, while for quantum models increasing D always increases both the best and median values.
Code to reproduce the results is provided in [37].[36], given the previous observations, for fitted quantum and classical models.Each table entry shows the best value over 100 repetitions of the fitting procedure for random initial configurations, and the mendian in paranthesis.

V. CONCLUSIONS
We have studied whether the quantum advantage in the simulation of classical stochastic processes persists even when we relax the assumption of exact simulation, which is unlikely to be reachable with real-world datasets.
We focused on the trade-off between accuracy and asymptotic memory storage.We introduced different compression algorithms for classical and quantum simulators of one-dimensional (e.g.temporal) stochastic processes, inspired by tensor network techniques, and analyzed different figures of merit to assess how the compressed model deviates from the ideal one when trying to make predictions.We found that quantum simulators can display a higher accuracy, in the prediction, for a given memory, or that, alternatively, can achieve the same accuracy by retaining less information from the training sequence.
As for future prospects, given that the information available in the training data bounds the generalization error [14], it is tempting to expect that quantum simulators may be able to learn a model with less data.A detailed study about this possibility is left to future investigations.Another future prospect concerns the use of quantum hardware capable of implementing mid-circuit measurements [38].Classical tensor network simulations are limited to matrices with a small memory dimension, while quantum computers are capable of simulating circuits like the one in Fig. 1 with a memory dimension that increases exponentially with the number of memory qubits.However, unlike tensor network simulations, the unitary operations are less general and constrained, e.g., to shallow quantum circuits.Furthermore, the overall depth of the circuit should be small, though longer depths can be obtained with circuit-cutting techniques [39].It is therefore interesting to explore the possibility of using a quantum hardware for two reasons: i) to see whether the quantum advantage in memory use persists even with the above constraints, and ii) to explore the possibility of the computational advantage enabled by quantum algorithms.

Algorithm 1
Compress the quantum simulator |ψ[A]⟩Require: A normalized d×D×D tensor A and the dimension D ′ < D of the compressed memory 1: function mpscompress(A, D ′ ) 2:

5 :
return mpsnormalize(B) 6: end function Algorithm 2 Normalize the quantum simulator |ψ[A]⟩ Require: A d × D × D tensor A 1: function mpsnormalize(A) 2: [A]⟩, where the tensor A has memory dimension D. We want to find a new simulator |ψ[B]⟩ where the B tensor has reduced dimensionality D ′ < D, such that the similarity of future observations is as high as possible and all Algorithm 3 Compress a Hidden Markov Model Require: A normalized d × D × D transition tensor T and the dimension D ′ < D of the compressed memory 1: function hmmcompress(T, D ′ ) 2:

FIG. 2 .Algorithm 4 3 :
FIG. 2. Distribution of the distance between consecutive ones in the exact and compressed quantum simulation of of a discrete renewal process of period N , for N = 32 (a) or N = 64 (b).

FIG. 4 .
FIG. 4.Compressed memory dimension M vs. Bhattacharyya coefficient (a) or vs. the similarity decay rate for L → ∞ (b) in the simulation of a discrete renewal process with N = 32 and M ≤ N .(c) Memory entropy vs. similarity decay rate in the same setting of (a-b).

FIG. 5 .
FIG.5.Distribution of the distance between consecutive ones in the exact and fitted classical (c-fit) and quantum (q-fit) simulators, and in the compressed classical (c-comp) and quantum (q-comp) simulators.The exact simulation uses N = 4, while the fitted ones either N = 4 (a) or N = 2 (b).Fitting is done using a training sequence of L = 10 4 observations, while the histogram is generated using 100 million samples.Table (c) shows the Bhattacharyya coefficients of the resulting quantum and classical simulators, either fitted from data or compressed from the exact ones, in predicting the next 10 future observations.

TABLE I .
Probability of observing the last 10 outcomes in the Lymphography dataset