Learning stochastic filtering

We quantify the performance of approximations to stochastic filtering by the Kullback-Leibler divergence to the optimal Bayesian filter. Using a two-state Markov process that drives a Brownian measurement process as prototypical test case, we compare two stochastic filtering approximations: a static low-pass filter as baseline, and machine learning of Volterra expansions using nonlinear Vector Auto-Regression (nVAR). We highlight the crucial role of the chosen performance metric, and present two solutions to the specific challenge of predicting a likelihood bounded between 0 and 1.

Introduction. -Stochastic filtering is the estimation of the current state of a stochastic process based on a history of noisy measurements. The optimal filter uses Bayes formula with the true measurement probabilities to continuously update the likelihood of states. If the stochastic dynamics of the underlying process is linear and measurement noise is Gaussian, this optimal filter is the celebrated Kalman filter [1]. For nonlinear problems, numerical approximations are available in the statistics literature [2]. These approximations become especially important if the true measurement probabilities are not known. However, approximations often lack interpretability and it may be difficult to evaluate their performance.
Given a generic stochastic filtering scheme, the uncertainty of its estimates can be decomposed into two contributions: the entropy of the optimal filter, and the Kullback-Leibler (KL) divergence [3,4] from the optimal filter. This second term was already suggested as a general measure of model quality in [5]; here, we propose it as a performance measure for stochastic filtering approximations that track a time-dependent probability distribution.
From an engineering point of view, stochastic filtering can be viewed as a problem of system identification: find an approximative model of a dynamical system from its time series [6,7]. A priori knowledge about the system allows to make an informed guess, i.e., a heuristic model with few parameters that can be tuned to an optimum. In the absence of such knowledge, one may harness machine learning to learn 'black-box' approximation with many parameters [8,9]. We will demonstrate the applicability of both approaches on a test problem, and highlight the spe-cific challenges of predicting a probability that is bounded between zero and one, as well as possible solutions.
Stochastic filtering. -We start by reviewing the basic features of stochastic filtering. Let x j be the unknown state of a stochastic dynamical system at discrete time j, and m j a (cumulative) measurement process, whose stochastic increments ∆m j = m j − m j−1 depend on the state x j of the hidden process. The likelihood p(x j ) formalizes the knowledge at time t j about the current state of the hidden process x j [conditional to all past measurements and possibly an initial prior p(x 0 )].
The likelihood evolution is decomposed into a prediction step, p(x j+1 ) = dx j p(x j+1 |x j )p(x j ), which replicates the stochastic dynamics p(x j+1 |x j ) of the hidden process, and an update step, p(x j+1 |∆m j+1 ) = p(∆m j+1 |x j+1 )p(x j+1 )/p(∆m j+1 ), which uses Bayes formula to incorporate the latest measurement ∆m j+1 . Here, the measurement probability p(∆m j+1 ) = dx j+1 p(∆m j+1 |x j+1 )p(x j+1 ) acts as normalization factor. If p(∆m j+1 |x j+1 ) is the true measurement probability, the filter p is optimal. The posterior p(x j+1 |∆m j+1 ) serves as new prior for the next time step j + 2. In the limit of a vanishing discretization, this procedure yields a time-continuous version of stochastic filtering. Below, we will assume that the dynamics of the hidden process is stationary, and that the influence of the initial prior vanishes in the long-time limit.
Model quality. -In many application cases, one has to resort to approximations of optimal stochastic filtering. The measurement process may not be known, or agents p-1 arXiv:2206.13018v1 [cond-mat.stat-mech] 27 Jun 2022 may lack the elaborate machinery needed for optimal filtering, as common for biological cells performing chemical sensing [10,11]. In these cases, we have a sub-optimal filter, whose likelihood estimates q(x j ) will in general differ from those of the optimal filter, q(x j ) = p(x j ). The performance of this filter q(x j ) can be sub-optimal due to reduced dimensionality, intrinsic noise, or systematic errors.
To quantify differences in performance, we examine relative entropies. The stochastic entropy for the optimal filter, s(x) = − ln p(x), is the negative log-likelihood of the actual state x. Its ensemble average defines the macroscopic entropy, S[p] = s(x) p = dx p(x)s(x), which quantifies the current uncertainty of the optimal filter. The stochastic entropy can also be defined for the suboptimal estimate, s (q) (x) = − ln q(x). Its ensemble average with respect to the true density is the cross-entropy, , which quantifies the true uncertainty of the sub-optimal filter. In general, H[p, q] is different from the macroscopic entropy S[q] = s (q) (x) q of q(x), because the sub-optimal filter has an erroneous estimate of its own uncertainty.
Optimal and sub-optimal estimates are related by a fluctuation-theorem, exp s(x) − s (q) (x) p = 1, which follows from the normalization condition for q(x). Using Jensen's inequality, we obtain the standard upper bound on the entropy by the cross-entropy, S[p] ≤ H[p, q], with equality only if p = q. This means that the log-likelihood of the actual state evaluated by the optimal filter is on average equal or greater than the log-likelihood of any other scheme, which we would indeed name sub-optimal. Still, the fluctuation theorem allows that in some realizations one can observe apparent violations p(x) < q(x).
The excess uncertainty of the sub-optimal filter is the , which gives D KL [p q] = ln(p/q) p . The Kullback-Leibler divergence was already proposed to measure model quality in [5]. Here, we propose to use a time-average D KL [p q] to quantify the performance of sub-optimal filters.
Below, we will consider a discrete state space for the hidden process x composed of just two states x= ± 1. By slight abuse of notation, we can identify p with the scalar p(x=1), and similarly q with q(x=1), and write Next, we introduce a minimal bistable model of stochastic filtering, and apply eq. (1) to quantify the performance of sub-optimal approximations.
Bistable model. -Consider a continuous-time Markov process x that jumps with symmetric rate r ≥ 0 between two states, x ∈ {−1, +1}, see fig. 1(a). Let us assume that we can perform measurements of the hidden state x t through a continuous measurement process m t described by the following stochastic differential equation where γ ≥ 0 is the signal strength and √ 2D dW denotes Gaussian white noise with noise strength D. The measurement model of eq. (2) describes, e.g., activation of receptors in a biological cell exposed to a time-varying, bistable ligand concentration x in the presence of stochastic ligand-receptor interactions subsumed as measurement noise [11,12].
We denote by p the likelihood of state x = 1. The optimal Bayesian estimation scheme allows to derive evolution equations for quantities of interest, such as the timedependent likelihood p(x, t), its invariant density ρ(p), or the expected entropy variation dS /dt. The calculation methods are standard and can be found, e.g., in [11,12] (likelihood evolution) [13,14] (invariant density) [15] (entropy variation).
Here, we report analytical results for the specific application problem embodied in eq. (3); for the convenience of the reader, details of the calculation are provided in the Supplementary Information (SI) Here, dm p = γ(2p − 1)dt, and products with the measurement process dm are to be interpreted in the Itō scheme of non-anticipative stochastic calculus. The invariant density of this likelihood reads where K is the normalization constant, see SI Notes. This invariant density is bimodal; implications of this bimodality for cellular signaling were studied in [16]. Similarly, eq. (3) also implies an evolution equation for the entropy S[p], and, in particular, for the expected entropy variation with respect to the current likelihood The first term in eq. (5) describes the expected increase of entropy in the prediction step, while the second term describes the expected decrease of entropy in the update step. Both terms depend on p in a nonlinear fashion, and are invariant under the symmetry operation p → 1 − p.
The prefactor γ 2 /D of the second term can be interpreted as a rate constant of information gain, similar to [17,18]. The thermodynamic interpretation of entropy variations in stochastic systems with measurement components has been discussed in [19][20][21][22].
Approximation I: Low-pass filter. -To construct our first approximation, consider a low-pass filter of the measurement process, dξ = dm − β ξ dt, with inverse relaxation time β. We want to estimate a likelihood q(ξ) for x t =+1 based on ξ t . In the limit of rare switching r β,  (2). Bayesian inference yields an optimal estimate of the likelihood p(t) of x(t)=+1 (lower panel, red curve), see eq. (3). As a suboptimal approximation q(t), we consider a low-pass filter (blue curve), eq. (6). Histogram of these likelihood estimates are shown to the right. (b) Mean Kullback-Leibler divergence D KL [p q] between optimal and sub-optimal filter as function of the inverse relaxation time β of the sub-optimal low-pass filter for different noise strengths γ and switching rates r of x(t) (where β, γ, and r are measured relative to 2D = 1). (c) There exists an optimal inverse time-scale β * minimizing D KL [p q] , which increases with r. Results are shown for two values of γ, corresponding to low signal strength (γ = 0.1, solid), and high signal strength (γ = 1, dashed). Error bars given by standard error of the mean (n = 5 realizations) are virtually invisible.
the probability density p(ξ|x) is given by the steady-state density of an Ornstein-Uhlenbeck process: a normal distribution with mean γx/β and variance D/β. By treating the hidden state as static, we introduce a systematic bias.
We can now apply Bayes formula using this steady-state density and the prior p(x=+1) = 1/2, and obtain the approximated likelihood which describes a logistic curve. Applying Itō's Lemma to eq. (6), we obtain an evolution equation for the approximated likelihood where dm q = γ(2q − 1)dt is the expectation value of the measurement increment according to the approximated likelihood q. We numerically determined the performance of this approximative stochastic filter using the Kullback-Leibler divergence, see fig. 1(b). Higher signal strengths γ reduce the performance relative to the optimal filter (higher D KL ), because likelihoods close to zero and one become more frequent, which are more difficult to approximate. Similarly, a lower switching rate r makes the estimation problem easier for both the sub-optimal and the optimal filter, but D KL [q p] may nonetheless increase, as only the optimal filter has the correct functional dependence on the measurement process. Lastly, the inverse relaxation time β * of the low-pass filter marks a trade-off between responding fast (small β) or responding precisely (high β), resulting in an optimum β * that minimizes the Kullback-Leibler divergence. This optimal β * increases with the switching rate r as expected, with approximately linear scaling for low γ, see fig. 1

(c).
Approximation II: machine learning. -As a second approximation of stochastic filtering, we consider nonlinear vector auto regression (nVAR) [23], a state-of-theart machine learning approach that learns, e.g., Volterra expansions from time series data. After briefly reviewing the theory behind nVAR, we propose how nVAR can be adapted for the problem of stochastic filtering, and demonstrate its performance for this task.
Generally, any dynamical system with input u(t) and output y(t) can be thought of as a functional relationship y(t) = H [u(t)] for some functional H that operates on time-dependent functions. For linear and time-invariant systems, one can show that this input-output relationship is just a convolution with a constant h 0 and a suitable kernel h 1 (τ ) (sometimes called linear response function or susceptibility). Causality implies h 1 (τ ) = 0 for τ < 0. For non-linear systems, eq. (8) generalizes to the Volterra series expansion, a well and an output y(t) by learning the kernels of a truncated Volterra expansion, eq. (9). During training, for each time point t j , feature vectors u n,j are constructed from monomials of order n in the time-delayed inputs u(t j ), u(t j−1 ), ..., u(t j−k ). These feature vectors (up to maximal order n ≤ N ) are combined into a single feature vector u j . All u j are then concatenated horizontally into one feature matrix U. Linear regression with regularization (ridge regression), eq. (11), yields an optimal weight vector w such that the vector y of outputs y(t j ) is approximated as y ≈ w · U, i.e., y(t j ) ≈ w · u j for all j. known description in system identification [24] Here, H 0 [u(t)] = h 0 , and H n is the nth-order Volterra operator, which may be written as a "higher-order convolution" Theoretically, inputs from a distant past and higher-order kernels influence the output y(t). For practical purposes, however, we can only use a finite input history and must truncate the Volterra expansion at some order. Despite these limitation, learning Volterra kernels using nVAR could approximate even non-polynomial systems surprisingly well [23]. Fig. 2 outlines the nVAR algorithm: The training data consists of discrete input and output time series {u(t j )} and {y(t j )}. For each time point t j = j ∆t, a feature vector u j is constructed, whose components comprise all unique monomials (up to a given order N ) built from the k + 1 delay terms u(t j ), u(t j−1 ), ..., u(t j−k ). The computational cost scales as O(k N ). The feature vectors u j are then concatenated into a matrix U, which, in direct analogy to the Volterra expansion, should be linearly related to the output vector y target (whose components are the y(t j )) by a yet unknown weight vector w as w · U ≈ y target . The optimal weights w are found by the minimization where · stands for the Euclidean norm. Eq. (11) includes a penalty for large weights with so-called ridge parameter α to reduce over-fitting. The optimization problem eq. (11) represents a standard task of quadratic programming, and can be efficiently solved by reduction to a linear-algebra problem with formal solution [25] w = y target U T (UU T + α I) −1 .
After training, the performance of the nVAR machine can be evaluated using an independent test data set: Multiplying the learned weight vector with the new feature matrix constructed from the test input yields a predicted output y pred , which is then compared to the true output y true of this test data. We apply nVAR to the specific problem of stochastic filtering. As example of application, we consider the Brownian measurement process from eq. (2): the input time series is given by the series of increments u(t j ) = ∆m j ≡ m(t j ) − m(t j−1 ), while the output corresponds to the likelihoods y(t j ) = p(x(t j )=+1).
From the learned weights w, we can read off the Volterra kernels h n (τ 1 , . . . , τ n ): The constant term h 0 is close to p-4 the theoretically expected value 1/2. The weights associated with the linear monomials u 1 = {u(t j ), u(t j−1 ), . . .} yield, after multiplying with ∆t, a time-discrete approximation of the first order kernel h 1 (τ ), see fig. 2(b). The approximated kernel follows an exponential decay h 1 (τ ) ≈ (γ/2) exp(−βτ ) θ(τ ), as expected. The bistable process considered as example of application obeys the symmetry, . This symmetry implies that h 2 (τ 1 , τ 2 ) (and in fact all Volterra kernels of even order) must vanish identically. In agreement with this symmetry, all weights associated with the second-order monomials are nearly zero. Moreover, these small weights do not reflect any apparent functional relation, suggesting that these weights result from the finite size of the training data. In line with this interpretation, inclusion of second-order terms does not improve the performance of the stochastic filtering approximation, whereas third-order terms provide a statistically significant improvement, see fig. 3.
A peculiarity of stochastic filtering is that likelihoods must always lie in the interval [0, 1]. Per se, the nVAR algorithm does not respect this property. To solve this issue, we suggest two different solutions. The first, simple solution is to chop the predicted output above 1 and below 0. This makes computed mean-squared errors more meaningful, and is even a prerequisite to compute a Kullback-Leibler divergence.
As a second solution, we applied a nonlinear logittransformation to the likelihoods p φ = logit(p) = ln which maps 0 < p < 1 bijectively to −∞ < φ < ∞. We then use nVAR to predict φ(t j ) with input ∆m j ; the predicted output φ is then transformed back to the corresponding likelihood p using the inverse transformation It is worth to note the formal similarity between eq. (14) and the simplified approximation eq. (6). On a practical note, to avoid large values of φ, which may cause problems during learning, we clipped inputs p to the interval [ε, 1 − ε] for a small number ε = 10 −8 before computing the transformed input φ. Fig. 3 compares these two approaches to deal with the fact that likelihoods are bounded, using two different performance measures, the conventional mean-squared error (mse), and the mean Kullback-Leibler divergence. Generally, the first approach (clipping) performs slightly better than the second approach (logit) in terms of the meansquared error, which is expected because the first approach directly minimizes this measure. In contrast, the logitapproach is superior when the mean Kullback-Leiber divergence should be minimized. This observation can be rationalized as follows: the nonlinear logit-transformation eq. (13) "stretches" likelihoods close to 0 or 1 out; as the Kullback-Leibler divergence is rather sensitive to these likelihoods, the logit-approach generally performs better in terms of the Kullback-Leibler divergence. As discussed above, including second-order terms does not improve performance for any of the two approaches as expected by symmetry, whereas third-order terms improve performance. This improvement is stronger for the clippingapproach compared to the logit-approach, presumably, because the nonlinear logit-transformation partially accounts already for the nonlinear dependence of p on the input dm.
These observations were confirmed for various parameter choices (not shown). In the limit of strong noise (small γ), likelihoods p will deviate only little from 1/2, rendering the dependence of p on dm approximately linear, which reduces the benefit of including third-order terms. Remarkably, the clipping-approach using only linear terms (h 1 ) outperforms the simple low-pass filter even if an optimal inverse relaxation time is chosen for the later. This highlights the short-comings of the indirect approach of approximation I, which first computes the auxilliary variable ξ t , which is then converted into an estimate for the likelihood using the steady-state probability distribution, eq. (6), instead of learning likelihoods directly as in approximation II.
For machine learning tasks, it is pivotal to choose hyperparameters judiciously. The number k of delay steps should satisfy k > (β * dt) −1 to cover the time-window, where the expected Volterra kernels h 1 (τ ) ∼ exp(−β * τ ) are large. Fig. 1(c) suggests β * ∼ r, where r is the switching rate of the two-state Markov process.
Next, under-fitting or over-fitting will occur when the number of weights and thus fit parameters is either much smaller or much bigger than the size of the training data, respectively. Intriguingly, performance is worst when the number of weights exactly matches the size of the training data [26]. If the number of weights increases further and exceeds the number of training data points, learned weights become partially randomized, which reduces the adverse effect of over-fitting [27], see also SI Notes. For our application example, performance was best when the number of weights equaled about 20% of the number of data points of the training data.
The common problem of over-fitting and resultant abnormal weights caused by noise in finite-size input data can be effectively mitigated by means of regularization, i.e., introducing a regularization term with ridge parameter α in the optimization problem eq. (12). Prediction performance for various choices of α are shown in the SI Notes. Interestingly, the logit-approach of applying a nonlinear transformation to the output before learning is rather sensitive to the choice of α if third-order terms are included, whereas the performance of the simpler clippingapproach is robust and virtually independent of the choice of α. state from noisy measurements. Because the optimal filter based on Bayesian inference can be analytically calculated only for simple examples, efficient approximations are needed. Here, we propose nonlinear vector autoregression (nVAR) [23] as an efficient method to learn the dynamic relationship between the likelihood of hidden states and noisy, time-dependent input. Specifically, nVAR allows to learn a Volterra expansion relating input and output, in our case, noisy measurements as input and likelihoods of hidden states as output. As a baseline, we additionally consider a simple low-pass filter. We quantify the performance of these approximations of stochastic filtering in terms of the Kullback-Leibler divergence from the optimal filter.
Using higher-order terms in the Volterra expansion or longer delays in nVAR, increases the accuracy of the predictor, but also increases computational cost. We show that nVAR with reasonable delays and only linear terms already performs better than the simple low-pass filter. Including third-order terms slightly improves performance further, whereas second-order terms are dispensable because of the symmetry of the problem.
A key issue is that likelihoods must always be bounded between zero and one -a property not automatically respected by common approximation schemes. We propose two possible solutions to this specific challenge of stochastic filtering: (i) clipping of predicted likelihoods outside the admissible range, and (ii) applying a nonlinear logit-transformation before training, and back-transformation of predicted outputs. The second approach displays higher fidelity when it comes to predicting likelihoods close to zero or one, and, concomitantly, shows a lower mean Kullback-Leibler divergence than the first approach. At the same time, the mean-squared error resulting from the second approach is still acceptable. As a drawback, the second approach is sensitive to proper regularization, and requires judicious choice of a ridge parameter. One could think that yet a third solution may be to learn a nonlinear model, where the output is given, e.g., in terms of a sigmoidal function applied to a weighted sum of the input features. However, learning such a non-linear model would be considerably harder than learning a linear model like nVAR, and convergence to a global optimum cannot be guaranteed (and in fact is unlikely for the larger number of fit parameters used here).
In conclusion, the logit-transformation seems a viable approach that minimizes the mean Kullback-Leibler divergence, while the mean-squared error remains acceptable. As a caveat, the logit-transformation is more susceptible to over-fitting and requires suitable regularization, while the simpler clipping approach is more robust.
While we restricted ourselves in nVAR to learning nonlinear terms up to third order, the method could be applied up to arbitrary order or using longer delays, with the availability of sufficient training data and computational resources being the only bottle-necks. Computa-tional complexity could be reduced by careful design of the feature vector, e.g., using coarser time-sampling for inputs with longer delays, or problem-specific basis functions constructed from the inputs [8]. Lasso regression allows to shrink less important weights to zero [28]. As an alternative to nVAR, multilayer neural networks have been proposed to learn Volterra expansions, with a direct relation between the internal weights of the network and Volterra kernels [9,29,30]. We have shown how learning Volterra expansions can be adapted to find stochastic filtering approximations, enabling future applications for on-line decision making. * * * We thank all members of the Biological Algorithms group as well as Marc Timme for stimulating discussions. ROR was supported by the Ministry of Science and Art of the Federal State of Saxony, Germany through grant 100400118 to Marc Timme and BMF, financed with tax funds on the basis of the budget adopted by the Saxon State Parliament (Forschungsprojektförderung Titelgruppe 70 des Sächsischen Staatsministerium für Wissenschaft und Kunst). AA was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) through grant FR3429/3-1 to BMF. BMF is supported by the DFG through FR3429/4-1 (Heisenberg grant), and under Germany's Excellence Strategy -EXC-2068 -390729961. ROR, AA, and BMF acknowledge support through the Center for Advancing Electronics Dresden (cfaed).

Likelihood evolution
Here, we derive eq. (3) of the main text. Analogous derivations for minimal models of stochastic filtering can be found, e.g., in [1,2,3].
Here the hidden state is a Markov process that jumps with symmetric rate r between the two states x ∈ {−1, +1}. The measurement process is dm = γ x dt + dW , where dW is a standard Brownian motion. Consider a regular time discretization of the process with time step ∆t. The prior likelihood is p(x j ) ≡ p(x j |m −∞:j ). Note that we consider all probabilities here as implicitly conditional on the measurement process trajectory up to time j.
The predicted likelihood is simply The discretized measurement process increment ∆m j+1 ≡ m j+1 − m j is written where ∆W j+1 = W j+1 − W j is the Brownian increment. The correction O(∆t 2 ) is true in statistical sense. Indeed, the probability of a hidden process jump within the time interval (t j , t j+1 ] is r∆t+O(∆t 2 ), and the upper bound on the correction is 2γ∆t corresponding to a jump close to t j . Then the product of the correction times the probability is upper bounded by a term of order O(∆t 2 ), and is negligible in the limit ∆t → 0 with respect to the ∆t and ∆W terms which results in a SDE. (Alternatively, we could just define the measurement to be dependent on only x j+1 and have no corrections.) The conditional measurement distribution in discrete time is the Gaussian N (γx∆t, 2D∆t), which we write as p ∆m j+1 x j+1 = ±1 = K(∆m j+1 ) exp(±γ∆m j+1 /(2D)) , where . The unconditional measurement probability is Then we apply Bayes formula to update the predicted likelihood p(x j+1 ) into the posterior p(x j+1 |∆m j+1 ) ≡ p(x j+1 |m −∞:j+1 ), and find using eqs. (S1), (S3), and (S4) where we denoted p ≡ p(x j =+1) and expanded expressions. Now we use the formal expression dW dW = dt, of Ito stochastic calculus which means that ∆m j+1 2 converges to 2D dt under the integral sign [4,5]. Then in the limit ∆t → 0 and in the sense of an Ito integral, we obtain the likelihood evolution equation where dm = γ(2p − 1)dt.

Steady-state density
Here, we review the derivation of the steady-state invariant density, which is based on the innovation process technique [6,7,8]. Let us define the innovation process as where the average is meant with respect to the current likelihood p. The innovation process increments have zero expectation, d W = 0, and a linear quadratic variation, d W d W = 2D dt. Then from Lévy theorem [4] it follows that the innovation process W is a Brownian motion. In light of this, we can rewrite the likelihood evolution, eq. (3) of main text, as Let us now introduce the Wald ratio as f ≡ ln p 1−p , and write its evolution equation with Ito's Lemma (using df /dp We see that the transformation has led to a non-multiplicative noise term. The corresponding Fokker-Planck equation reads where in the last line we wrote the continuity equation in terms of the probability current Φ(f, t). The steady-state equation ∂ t P * (f ) = 0 in 1D without jumps implies that the probability current is zero [9], Φ * (f ) = 0. Solving this equation, we get the steady-state distribution of f , and then we transform it back to obtain the invariant density for the likelihoods ρ(p) = Φ * (f (p)) df dp , which is eq. (4) in the main text, where K is the normalization constant ensuring 1 0 dp ρ(p) = 1.

3 Volterra kernels from the learned weights
Here, we visualize the Volterra kernels h 2 and h 3 , obtained from the learned weights. The feature vector of the nVAR machine is designed in a peculiar manner as outlined in fig. 2(a) of the main text. The weights associated with the linear monomials u 1 characterize (a time-discrete approximation when multiplying with ∆t) the first order kernel h 1 (τ ) as shown in fig. 2(b) of the main text. Similarly, one can identify the weights associated with the second order monomials u 2 as the second order kernel h 2 (τ 1 , τ 2 ), see fig. S1. This kernel is almost zero and does not represent any apparent functional relation. Moreover, including h 2 in nVAR does not improve the performance of the resultant stochastic filtering approximation. Indeed, symmetry implies h 2 = 0 in our case. Finally, we identify the weights associated with third order monomials u 3 as the third order kernel h 3 (τ 1 , τ 2 , τ 3 ), see fig. S2. 4 Performance dependence on ridge parameter In ridge regression, eq. (12) of the main text, the ridge parameter α has to be set judiciously to control the size of optimal weights, as a measure to avoid over-fitting. In this section we show how ridge parameter affects the performance in our specific example. We considered two approaches in our study. Performance measures for the clipping approach is shown in fig. S3. The performance virtually did not change over a wide range of α.    The nVAR machine gives acceptable results even for shorter training data sets. However, its performance strongly depends on the number of weights that has to be optimized. To highlight this point, we consider the ratio λ = size of weight matrix size of the training data .
(S12) Fig. S5 shows the mean squared error (mse) as a function of this ratio λ. Here, the size of the training data was kept fixed, while the delay k was varied. Note that the number of weights increases polynomially with k, see also fig. 2 (a) of the main text. The training error monotonically decreases to zero if λ increases, approaching virtually zero in the over-fitting regime with λ > 1. In contrast, the testing error changes non-monotonically with λ: the testing error initially decreases to a minimum, then increases and attains a peak close to λ = 1. In the over-fitting regime for λ > 1, the testing error once again decreases. We refer to this non-monotonic depence on λ as "double descent curve", similar to [10].  Figure S5: The training error (blue) decreases monotonically as function of the ratio λ between the number of fit parameters (weights) and the size of the training data. In contrast, the testing error (red) displays a local minimum, and attains a peak close to λ = 1, before it descents again in the over-fitting regime for λ > 1. Parameters: γ = 5 s −1 , r = 2 s, D = 0.5 s −1 , duration of training data = 6 s, duration of testing data = 40 s, ∆t = 0.005 s, N = 2, ridge parameter α = 10 −3 . Reported results represent mean±s.e.m. from 5 realizations.