Backpropagation in hyperbolic chaos via adjoint shadowing

To generalise the backpropagation method to both discrete-time and continuous-time hyperbolic chaos, we introduce the adjoint shadowing operator S acting on covector fields. We show that S can be equivalently defined as: S is the adjoint of the linear shadowing operator S; S is given by a ‘split then propagate’ expansion formula; S(ω) is the only bounded inhomogeneous adjoint solution of ω. S is the adjoint of the linear shadowing operator S; S is given by a ‘split then propagate’ expansion formula; S(ω) is the only bounded inhomogeneous adjoint solution of ω. By (a), S adjointly expresses the shadowing contribution, a significant part of the linear response, where the linear response is the derivative of the long-time statistics with respect to system parameters. By (b), S also expresses the other part of the linear response, the unstable contribution. By (c), S can be efficiently computed by the nonintrusive shadowing algorithm in Ni and Talnikar (2019 J. Comput. Phys. 395 690–709), which is similar to the conventional backpropagation algorithm. For continuous-time cases, we additionally show that the linear response admits a well-defined decomposition into shadowing and unstable contributions.

The adjoint method is a method for efficiently computing the gradient of an objective (or observable) function Φ with respect to many system parameters γ; it typically involves computing some adjoint operators, and it is typically evaluated on one or a few orbits of a dynamical system.Its cost is almost independent of the number of parameters.Conventionally, the adjoint method solves one inhomogeneous adjoint equation, which is the pullback of covectors, and it runs backwards in time.Hence, adjoint method is also called the backpropagation method.
The adjoint method is the adjoint of the pathwise perturbation method.Hence, the conventional adjoint method works only for stable systems or short-time.For unstable systems with positive Lyapunov exponents, we run into the so-called 'gradient explosion' phenomenon, that is, the adjoint solutions (or the pullback of covectors) grow exponentially fast as we increase the orbit length.As a result, the derivative of the objective at a specific time is poorly defined when the orbit is long, and we need to consider the perturbation of an averaged objective over a long-time or a measure.The derivative of averaged objective is called the linear response, and conventional adjoint method does not work on this case.Pioneering attempts to fix the adjoint method, such as the gradient clipping, do not have good math explanations and not always work well [52].
The linear response is theoretically studied in hyperbolic systems, where all Lyapunov exponents are non-zero [58,18,4,25,28,29,5].The chaotic hypothesis of Gallavotti and Cohen states that many high-dimensional physical systems are hyperbolic [26,10,64].It seems that this hypothesis does not hold strictly, since Wormell and Gottwald recently gave a counter example [64,63].Also, weather systems seem to have a significant amount of non-hyperbolic directions [14], but still seem to have linear responses.However, even with these counter examples, the morale of the chaotic hypothesis seems to be still valid: real physical systems may have a large ratio of hyperbolic directions or a large hyperbolic region, so theories based on hyperbolicity should still be a very important part of a full solution.
The most well-known linear response formulas are the pathwise perturbation formula, the divergence formula, and the kernel differentiation formula, all having adjoint versions.The kernel differentiation formula, also known as the likelihood ratio method in probability context [57,55,27,32,46], requires that the system must be stochastic, and the cost is large when the noise is small.The other two formulas work for both deterministic and stochastic systems; they are not hindered by small noise, but both have their own unique shortcomings.In this paper, we consider only the deterministic case with hyperbolicity: this was the path for classical dynamical systems theory.
The pathwise perturbation formula averages conventional adjoint formula over a lot of orbits, which is also known as the ensemble method or stochastic gradient method; however, it is cursed by the gradient explosion [38,20,41,30].The divergence formula, also known as the transfer operator formula computes the perturbation operator, which is not pointwisely defined, and we have to partition the full phase space to obtain some mollified approximation: this is cursed by dimensionality [36,21,39,16,54,23,24,64,13,2,62,22,31,3,66].A promising direction is to blend the two formulas, but the obstruction was that the split lacks smoothness [1].There are pointwisely defined formulas such as [58,29], but they still involve terms whose expressions are not obvious; moreover, they are not recursive, so can not be computed efficiently; even if we use our tools to make those formulas recursive, the number of recursive relations would be large.
Our recent work, the fast response formula and algorithm, solve the smoothness issue for the blended approach, and computes the linear response of discrete-time hyperbolic systems on a sample orbit [43].In some sense the formula is the ergodic theorem for linear response: it computes the pushforward of 2u + 2 M -dimensional vectors on an orbit, and then the average of some pointwise functions of these vectors converge to the linear response.Here M is the phase space dimension, and u is the number of unstable Lyapunov exponents.This number of recursive relations seems to be close to the least possible, since we need at least u many relations to capture all the unstable perturbative behaviors.
In the discrete-time fast response algorithm, the linear response is decomposed into shadowing and unstable contributions.The shadowing contribution (SC) is given by the shadowing vector, which is the difference between two orbits close to each other but with perturbed parameters [11,53].Due to the similarity between this characterization and the conventional pathwise perturbation method, the shadowing vector can be efficiently computed by the nonintrusive shadowing algorithm [50,51,42].When the ratio of unstable directions is low, with some additional statistical assumptions, shadowing can sometimes be a good approximation of the entire linear response [44].Nonintrusive shadowing is also necessary for efficiently computing the unstable contribution in the fast response algorithm, which is given by a tangent version of the equivariant divergence formula [43].
Our works above are not 'adjoint', and their costs are linear to the number of parameters.Adjoint theories and algorithms are nontrivial: we need more work finding good characterizations than just transposing matrices.Finding a neat characterization is especially difficult in continuous-time, where the flow direction is a Lyapunov vector with zero exponent and can be easily captured.Hence, we need to design special treatment for the flow direction, which also assembles well with other directions.
Moreover, there are some lingering questions about shadowing methods for continuous-time.In particular, it was not clear that, in continuous-time cases, different previous shadowing algorithms were computing the same quantity.It was also not clear whether shadowing methods compute a significant part of the linear response.We shall give positive answers to these questions.

Main results.
As part of the task to develop adjoint methods for the parameter-gradient of long-time statistics of hyperbolic chaos, this paper develops the adjoint shadowing theory for both discrete-and continuous-time systems.Shadowing is of particular interest because: • The shadowing contribution SC can sometimes be a good approximation of the entire linear response; • Shadowing is also necessary for computing the other part of linear response, the unstable contribution U C; • The shadowing covector looks very similar to the conventional backpropagation solution, so it can be implemented relatively easily.
We shall also answer some lingering questions for shadowing methods in continuoustime.
For the discrete-time system of a map f parameterized by γ, section 3 first shows that the adjoint system on the covector space, given by recursively applying f * , is also hyperbolic, where the pullback operator f * is basically the transposed Jacobian matrix.So we can define oblique projections onto the unstable and stable cotangent subspace, denoted by P u and P s .Section 3.2 and section 3.3 prove lemma 4 and theorem 1, the equivalence of three characterizations of the adjoint shadowing operator S for discrete-time.Lemma 4 is pathwise, whereas theorem 1 is stated on an attractor with the physical measure.First, fix a C ∞ observable (also called objective) function Φ.For an orbit on the attractor K, the long-time-average statistic typically coincides with the SRB measure ρ supported on the attractor.Let ρ(Φ) denote the integration of Φ according to ρ.The linear response is δρ(Φ), where γ is some parameter of the system, Let X(K), X α (K), X * (K), and X * α (K) be the spaces of continuous or Holder-continuous vector and covector fields on K. Let S be the linear shadowing operator to be defined in section 2.1.Denote (⋅) n ∶= (⋅)(f n x).Also, for any vector X and covector ω, we denote the product Xω ∶= ωX ∶= ω(X).
Theorem 1 (adjoint shadowing lemma for discrete-time).On a compact mixing axiom A attractor with physical measure ρ, the adjoint shadowing operator S ∶ X * α (K) → X * (K) is equivalently defined by the following characterizations: (a) S is the linear operator such that Hence, if X = δf ○ f −1 , ν = S(dΦ), then the shadowing contribution is Here the last equality holds for almost all initial conditions on the basin of the attractor according to the Lebesgue measure.(b) S(ω) has the expansion formula given by a 'split-propagate' scheme, (c) The shadowing covector ν = S(ω) is the unique solution of the inhomogeneous adjoint equation, Moreover, S preserves Holder continuity.
For the continuous-time systems, the shadowing contribution was not clearly defined before.To do this, first, section 4.2 defines the quotient space where the equivalent relation ∼ is defined as We define the (tangent) shadowing operator S ∶ X α (K) → A(K) as the Here [v, η] is the equivalent class of (v, η) according to ∼.Let F (ψ) be the derivative of ψ along F , define the space of pairs of a covector field and a scalar function, Section 5.1 shows that, for (ω, ψ) ∈ A(K), ⟪S(X); ω, ψ⟫ is well-defined, where ⟪v, η; ω, ψ⟫ ∶= ρ(ωv) − ρ(ηψ).
Hence we can well-define the shadowing contribution SC as These definitions can all be made pathwise.Section 5.2 shows that the other part of the linear response, the unstable contribution U C, is indeed only in the unstable direction.Hence, there is a chance that U C is small when the unstable dimensional is low, using arguments similar to [44].Section 6.2 proves lemma 9, the pathwise adjoint shadowing lemma in continuoustime; section 6.3 proves its attractor version, theorem 2. First, we define some notations.Let f be the flow of the vector field F + γX, where F and X are two fixed vector fields.Denote (⋅) t ∶= (⋅)(f t x).Let ε c be the covector in the center subspace such that ε c (F ) = 1.Let L (⋅) (⋅) be the Lie derivative.Let ∇ (⋅) (⋅) be a Riemannian covariant derivative; so in R M , ∇ F is ∂/∂t.Define the derivative along a covector, say ∇ ν F , as ∇ F ν − L F ν; by eq. ( 11) in section 6.1, Theorem 2 (adjoint shadowing lemma for continuous-time).On a compact mixing axiom A attractor with physical measure ρ and decay of correlations in eqs.(6) and (7), the adjoint shadowing operator S ∶ A(K) → X * (K) is equivalently defined by the following characterizations: (a) S is the linear operator such that ⟪S(X); ω, ψ⟫ = ρ(XS(ω, ψ)) for any X ∈ X α , Hence, if X = δf , ν = S(dΦ, Φ − ρ(Φ)), then the shadowing contribution is (c) The shadowing covector ν = S(ω, ψ) is the unique solution of the inhomogeneous adjoint ODE, Moreover, S preserves Holder continuity.
Section 3.4 and section 6.4 explain and discuss the utilities of each characterization in the adjoint shadowing lemma, for both discrete-and continuous-time.First, the expression of SC in (a) by S gives an 'adjoint' method for SC, which means that the main term ν can be computed by a few recursive relations on an orbit, and does not depend on the parameter, or X: this typically requires some non-trivial characterizations of some adjoint operators, hence the name.The integration in SC can also be sampled on an orbit.
The governing equation in (c) is the same as the conventional backpropagation method, but here we added boundedness.This allows the shadowing covector and hence the shadowing contribution be efficiently computed via the nonintrusive shadowing algorithm in [47] (also see [8], which is explained at the end of section 6.4).Nonintrusive means to compute O(u) many vectors recursively on one orbit, where u is the unstable dimension.The nonintrusive adjoint shadowing algorithm was used on a 10 6 dimensional fluid problem; the cost was on the same order of simulating the flow, and the shadowing contribution is a useful approximation of the entire linear response.
Somewhat surprisingly, using (b), we can see that the other part of the linear response, the unstable contribution, can also be expressed by S applied on an equivariant divergence.This allows the unstable contribution be also computed with O(u) cost per step.The adjoint theory and algorithm for the unstable contribution of discrete-time systems are in [49,45].The theory of continuous-time case uses adjoint shadowing lemma in continuous-time [48].
Section 7 discusses a plausible future line of study: adding randomness.It is known that many systems do not have linear response due to lacking hyperbolicity.However, engineers can still ask for approximate derivatives.The most plausible solution seems to be adding local noise to bad regions with poor hyperbolicity.Then we can use the kernel differentiation trick to circumvent the bad properties of the dynamics.However, this potential program still misses a few key techniques, in particular, we need formulas for the random and deterministic regions to communicate.

Hyperbolicity and tangent shadowing.
We assume uniform hyperbolicity.Let f be a smooth diffeomorphism on a smooth Riemannian manifold M, whose dimension is M .Assume that K is a compact hyperbolic set, that is, T K M (the tangent bundle restricted to K) has a continuous f * -invariant splitting into stable and unstable subspaces, T K M = V s ⊕ V u , such that there are constants C > 0, 0 < λ < 1, and Here f * is the pushforward operator on vectors, when M = R M , f * is the Jacobian matrix ∂f /∂x.Define the oblique projection operators P u and P s , such that We further assume that K is an attractor, that is, there is an open neighborhood U , called the basin of the attractor, such that ∩ n≥0 f n U = K.
Then we review tangent theories and nonintrusive shadowing algorithms.In this paper, 'tangent' means the pushward of vectors, which runs forward in time, and the cost is typically independent of the number of observables Φ. Adjoint solvers typically involve the pullback of covectors, it runs backward and compute the gradient of one objective with respect to many parameters γ.
We define the shadowing vector by its characterizing properties.Take an orbit O ∶= {x n = f (x n−1 )} n∈Z on the attractor with discrete topology on all steps, in particular, if x 1 = x 2 , they are still counted as two disjoint points.Let X(O) be the space of bounded sequences of vectors.For any X ∈ X(O), or {X n ∈ T xn M} n∈Z , the shadowing vector v ∈ X(O) is the only bounded solution of the inhomogeneous tangent equation, Here the subscript (⋅) n ∶= (⋅)(x n ).Since the inhomogeneous tangent equation governs perturbations of an orbit due to δf , shadowing vector is just the first-order difference between shadowing orbits, which are two orbits with slightly different governing equations, but are always close to each other [11].Define shadowing operator We can further define the linear shadowing operator S ∶ X α (K) → X α (K), where X α (K) is the space of Holder-continuous vector fields on K.The appendix A proves that S preserves Holder continuity.Our definition of S is comparable to characterization (c) of the adjoint operator S.
We give the expansion formula of the shadowing vector that fits the definition.First, we write out the conventional inhomogeneous tangent solution with zero initial condition, v ′ .Propagate previous perturbations to step k, then sum up, The expression of the shadowing vector can be obtained similarly, but we should first decompose X, then propagate stable components to the future, unstable components to the past.Both procedures are shown in figure 1.The expansion formula for v is This formula works both on an orbit and on the entire attractor.
We call this procedure to obtain v a 'split-propagate' scheme; it occurs very frequently in orbit-based linear response theory.Since we split X into the stable and unstable vectors, their exponential decay makes v uniformly bounded.The 'propagation' is a variant linear superposition law, which guarantees that v is still an inhomogeneous tangent solution: the proof is similar to that of lemma 4 in section 3.2.
Inhomogeneous tangent solution with zero initial condition (left), and the shadowing vector (right).
We compute the shadowing vector by recovering properties in its definition.Currently, the most efficient algorithm is the nonintrusive shadowing algorithm [50,51].Nonintrusive means to compute the pushforward or pullback of only O(u) many vectors or covectors; here we compute only u many vectors.The algorithm solves the nonintrusive shadowing problem on an orbit, Here v ′ is a particular inhomogeneous adjoint solution.Intuitively, the unstable modes are removed by the orthogonal projection at the last step, where the unstable modes are the most significant [50].Among all algorithms for stable or shadowing part of the linear response, nonintrusive shadowing is faster than some algorithm, including the stable part of blended response [1,60,9]; and it seems to be faster than the others, where a direct comparison is difficult [7,37,34].

Physical measure and linear response.
Many hyperbolic sets, say mixing axiom A attractors, admit physical SRB measures, denoted by ρ.That is, fix any C ∞ observable function Φ ∶ M → R, then for almost all x in the attractor basin U , lim In other words, physical measures give the long-time statistic of the chaotic system.The physical measure has several other characterizations [65].
Assume that the system is parameterized by a parameter γ, the parameterization γ ↦ f is C 1 as a map R → C r , where r ≥ 2. The linear response formula expresses δρ by δf ∈ C 2 , where δ(⋅) ∶= ∂(⋅)/∂γ at γ = 0 may as well be regarded as small perturbations.It is proved that [58], Linear response formulas are proved to give the correct derivative for various hyperbolic systems [58,18,33,15], yet it fails for some cases [4,64].
Different linear response formulas may have very different numerical cost behaviors.But no previous algorithms could afford to run on high-dimensional deterministic hyperbolic chaos.To get a formula with better cost, decompose the linear response, which we call the shadowing and the unstable contribution; this decomposition is straightforward in discrete-time.Like a Leibniz rule, the shadowing contribution accounts for the change of the location of the attractor, while the unstable contribution accounts for the change of the measure.
The shadowing vector gives the shadowing contribution of the linear response for chaotic systems, This utility is similar to conventional tangent solutions, which computes the linear response, or the parameter derivative, of stable dynamical systems.

Adjoint shadowing for discrete-time
This section first shows that the system of pulling-back covectors is also hyperbolic.Then we prove the equivalent characterizations of the adjoint shadowing operator S, first on a hyperbolic path, then on an axiom A attractor with the physical measure.Then we discuss the utility of each characterization.

Adjoint hyperbolicity.
This section shows that the iteration of the pullback operator f * is also hyperbolic.Here f * is the adjoint of f * , that is, for any x ∈ M, w ∈ T x M, and η ∈ T * f x M, f * is the operator such that Define the adjoint projection operators, P u and P s , such that for any Define the image space of P u and P s as V * u and V * s , then for any The next lemma shows that V * u and V * s are in fact unstable and stable subspaces for the adjoint system.Hence, the adjoint system is also hyperbolic, and the adjoint covariant subspaces are dual to the tangent ones.

Lemma 3 (adjoint hyperbolicity).
There are C > 0, 0 < λ < 1, such that for any Remark.(a) Notice that f * moves covectors backwards in time.(b) By this lemma, if we pullback u-many randomly-initialized covectors for many steps, they automatically occupy the unstable subspace, since their unstable parts grow while stable parts decay.
Proof.Since K is compact, there is constant C > 0, such that for any x −n ∈ M, any Here each C is uniform on K, but its exact value may change from line to line.The statement on P u can be proved similarly.□

Pathwise adjoint shadowing lemma.
In classical dynamical system theory, the shadowing lemma can be stated on a path on a hyperbolic set [61].In particular, it does not assume axiom A nor the existence of physical measures.In theorem 1 we assume axiom A so that characterization (a) is more informative and we can prove (a) ⇒ (b).This subsections states a pathwise version of the lemma which does not assume axiom A, but then (a) is less informative.For generality, we use the discrete topology on all steps of an orbit O, in particular, if x 1 = x 2 , they are still counted as two disjoint points.
(b) ν has the expansion formula given by a 'split-propagate' scheme, (c) ν is the unique bounded solution of the inhomogeneous adjoint equation, We define the pathwise adjoint shadowing operator, a linear operator S ∶ X * (O) → X * (O), by S(ω) = ν in (b) or (c).To get (b) from (a), we need more assumptions on the recurrence of all points on O.For example, if O is periodic, or if O is a regular path for the physical measure and ν is continuous, then we can prove (a) ⇒ (b): this is related to the statement of theorem 1.
The expansion of the shadowing covector in (b) is the 'split-propagate' scheme applied on covectors.That is, first decompose ω according to the adjoint hyperbolicity, propagate stable components to the past, unstable components to the future; then ν is the summation of the covectors propagated to the step of interest.This is similar to the conventional adjoint solution, ν ′ , which is the sum of ω from only the future.Both procedures are illustrated in figure 2.
Illustrations for inhomogeneous adjoint solution with zero initial condition (left), and the shadowing covector (right).
The characterization in (c) is very similar to the conventional adjoint method.The governing equation is the same, and the only difference is the extra boundedness requirement.Boundedness is free in stable systems where all perturbations decay; in chaos boundedness requires more cost to achieve.The similarity indicates that the adjoint shadowing covector is also related to the parameter derivatives of the dynamical system, which shall be more clear once we introduce the physical measure.
Proof.(b) ⇒ (c).ν is bounded on the entire attractor K and on every orbit, due to the exponential decay in the adjoint hyperbolicity.For the governing equation, (c) ⇒ (b).For uniqueness, pick any orbit and any two bounded inhomogeneous adjoint solutions, and let ν ′′ be their difference.Then ν ′′ is a bounded homogeneous adjoint solution.If ν ′′ is not zero, then it grows exponentially fast either in the positive or negative time direction, so ν ′′ can not be bounded.
(b) ⇒ (a).We want to show lim P 1 − P 2 = 0, where Define a pivot quantity P 3 , which is a sum involving only ω n and X n for |n| ≤ T , Substitute the expression of S(X) into P 1 Hence The difference between P 2 and P 3 is similar.□

Adjoint shadowing lemma.
Once we have the physical measure ρ, we can define the product by integration of ρ, then we can prove (a) ⇒ (b), so all three characterizations are equivalent.Moreover, the new product gives an expression of a part of linear response.
Theorem 1 (adjoint shadowing lemma for discrete-time).On a compact mixing axiom A attractor with physical measure ρ, the adjoint shadowing operator S ∶ X * α (K) → X * (K) is equivalently defined by the following characterizations: (a) S is the linear operator such that Here the last equality holds for almost all initial conditions on the basin of the attractor according to the Lebesgue measure.(b) S(ω) has the expansion formula given by a 'split-propagate' scheme, (c) The shadowing covector ν = S(ω) is the unique solution of the inhomogeneous adjoint equation, Moreover, S preserves Holder continuity.
The unique existence of the adjoint operator in (a) follows from its equivalence to (b) and (c).Since the operator in (a) is not in Hilbert spaces, we can not get its unique existence from the textbook knowledge on adjoint operators.We may as well define S and S as operators on L 2 (ρ), but many statistical application still require Holder.
Proof.The proof of (b) ⇒ (c) is the same as the pathwise lemma 4.
(c) ⇒ (b).For any point x ∈ K, if another vector field satisfies the adjoint equation but does not equal S(ω), then it must go to infinity either in the past or the future of x.But this contradict our overall assumption that S(ω) is continuous on K so it must be bounded.
(b) ⇒ (a).First prove the duality.We can prove via the pathwise version lemma, or somewhat similarly but more easily, substitute the expansion formula of S in eq. ( 1) into the shadowing contribution in eq. ( 2), then move all operations to ω, apply the invariance of SRB measures, Characterization (c) can be efficiently and conveniently recovered by nonintrusive shadowing algorithms [47].More specifically, to compute shadowing covector ν of ω, we solve the nonintrusive adjoint shadowing problem on an orbit of step 0, 1, . . ., N , Here ν ′ is the inhomogeneous adjoint solution; ε is a matrix with u columns of homogeneous adjoint solutions with ε N randomly generated, Intuitively, the main cause of the growth of ν ′ is that ν ′ − ν contains unstable components.During the pullback, the span of ε becomes a good approximation of the unstable subspace, so we can tame the growth of ν ′ by subtracting a linear combination of columns of ε.There are various ways to find the linear combination, for example, if ν ⊥ V * u at step 0, then ν can not have a significant unstable part most of the time.The proof of the convergence of the algorithm is in [44].
The nonintrusive shadowing algorithm is easy to implement and efficient.It is very similar to conventional backpropagation algorithms, where only one inhomogeneous adjoint solution, ν ′ , is solved.Nonintrusive shadowing runs the conventional backpropagation solver u more times to get ε.Solving extra adjoint equations are not as expensive as solving the first one, since the most expensive cost for solving these linear ODEs is loading the Jacobian matrix f * .
Characterization (a) in the adjoint shadowing lemma tells us that we can use S to compute SC 'adjointly' in the sense of utility, which means that major computations are away from X, so a new parameter (hence a new X) does not require much recomputation.In numerical practice, 'adjoint' also sometimes means the computation involves only the pull back of covectors: that is also true for this paper, but not for the equivariant divergence formula, which is the unstable part of the linear response.
When u ≪ M , it is likely that the unstable projection tends to be small, so the unstable part of the perturbation tends to be small; moreover, if the system has fast decay of correlation, then U C tends to be small, and SC can be a good approximation of the entire linear response [44], which can be expressed by S.However, we must notice that this estimation is based on statistical assumptions which are not universal, and we can construct many examples which violate these assumptions.This estimation is mainly an explanation and expectation of what we observe in practice, for example in [42].The morale is that, when u ≪ M , we should notice that we can perhaps already get something useful with a simple extension of the numerical tools we already have.When the unstable contribution is large, or when better accuracy is desired, for example near design optimal, we need to further compute the unstable contribution [12,37,7].
Because the 'split-propagate' scheme in (b) also appears in the expansion formula of U C, the adjoint shadowing lemma is also important for computing the unstable contribution.Our work on shadowing will not be wasted if we later want a precise algorithm for the entire linear response.More specifically, Here σ is the conditional measure on unstable manifolds, and δ u Lu σ is the unstable perturbation of the transfer operator on σ, which has the equivariant divergence formula, Roughly speaking, div v , the so-called 'equivariant divergence', is the contraction by the unit unstable cube and its co-cube; the covector div v f * is the divergence of the Jacobian matrix f * .This formula is also 'adjoint' in the utility sense, since major computations are away from X and ∇X.This formula can be computed by 2u recursive relations on an orbit: this is in some sense the ergodic theorem or Monte-Carlo formula for the linear response.The detailed theory for the equivariant divergence formula for discrete-time is in [49]; the discrete-time algorithm and numerical examples are in [45].

Hyperbolicity and tangent shadowing.
For continuous-time, let f be the flow of F + γX, where both F and X are smooth vector fields, is the base flow, X is the perturbation, and the parameter γ has base value zero.Assume that Since the attractor K is compact, |F | is uniformly bounded away from zero and infinity.Now T K M has a continuous f * -invariant splitting into stable, unstable, and center subspaces, Where V c is the one-dimensional subspace spanned by F .Should we have F = 0 at some points, then the hyperbolicity of the entire system may be ruined and our works can be affected to various degrees: this requires some new techniques, which we shall discuss in section 7.
The notations in continuous-time is more convoluted.For an orbit x t = f t (x 0 ), a homogeneous tangent equation of e t ∈ T xt M can be written in three ways: Here L is the Lie-derivative and ∇ is the Riemannian derivative.The last expression is an ODE since ∇ F is typically denoted by ∂/∂t in R M .
To define the tangent shadowing operator, for any orbit O with topology of R, (that is, a self-cross is counted as two points), define A(O) as the quotient space Here X 1 (O) is the space of bounded vector fields continuously differentiable along O.The equivalent relation ∼ is defined as We can further define A over K as the quotient space

Note that A(K) has higher regularity than A(O), since we can talk about better regularities on K.
We define the shadowing operator We call (v, η) a 'shadowing pair' of X.Here [v, η] is the equivalent class of (v, η) according to ∼.The map is well-defined, due to the definition of ∼.We sometimes omit the subscript of S and bracket following A and X, when there is no ambiguity, or when the argument applies to both cases.Intuitively, v points to the shadowing orbit and η is the time-rescaling: this is intuitively explained in [50, appendix C], whose η is the negative of this paper.In flows, the shadowing orbit, which lies close to the original orbit, is unique as a set of points.However, the points on shadowing orbits may not move at the same speed, and we need to reparameterize time, so that the points on shadowing orbits are close for all time.This time reparameterization is bounded but not unique [35, theorem 18.1.7].Hence the pair (v, η) is not unique.For example, adding F to v gives another shadowing pair.
We write out the expression of a specific shadowing pair that fits the definition, It follows that η = ε c X so it is Holder on K, where ε c is the unit center covector; v is Holder due to appendix A. Then The nonintrusive shadowing algorithm has been applied to a 4 × 10 6 dimensional system in computational Navier-Stokes equations [50,51,42].It does not solve for the shadowing pair in eq. ( 4), rather, it solves for the orthogonal projection of v, and the integration of the corresponding η.More specifically, it solves Here v ′ is a particular inhomogeneous adjoint solution, and (⋅) ⊥ is the orthogonal projection onto the subspace perpendicular to F .It was not very clear that its result is the same with other shadowing pairs and gives a significant part of (but not the whole) the linear response.Section 5 shall provide positive answers.

Physical measure and linear response.
Under similar assumptions as discrete-time, the attractor admits a physical SRB measure ρ, and similar linear response theorem has been proved [59,18].In particular, the linear response in continuous-time is the same, assuming the decay of correlation, ( 6) Here Φ t = Φ ○ f t , C = div cu σ X cu is the submanifold divergence on the center-unstable manifold under conditional SRB measure.
The shadowing/unstable decomposition of the linear response in continuous-time is a little tricky since it is not apparent which part should have the center direction.This issue is resolved in section 5. We shall also assume another decay of correlation lim T →∞ ρ (ηΦ T ) = ρ(η)ρ(Φ) for any Holder continuous η. (7) Both decorrelation can be proved under various assumptions [40,17], but we shall just assume them in this paper.

Shadowing/unstable decomposition for continuous-time
We define the shadowing/unstable decomposition for the linear response of flows.Due to the presence of the center direction and the non-uniqueness in the shadowing pair, this decomposition is not straightforward and was not previously defined.When defining the shadowing contribution, we want to give positive answers to the two historically lingering questions: (a) Do different shadowing pairs in the same equivalent class in A, defined in eq. ( 3), give the same shadowing contribution?(b) Is the shadowing contribution a significant part of the linear response?

Well-definedness of a product between A and A.
On a hyperbolic set, for an orbit O on with real-line topology, define the space of pairs of a covector field and a scalar function differentiable along F , Here X * is the space of continuous bounded vector fields on O.We can also define A over K with higher regularity The main case in this paper satisfies ω = dψ and hence the constraint in A, but there are cases where the differential holds only in the flow direction, for example in U C: this will be clear in a forthcoming paper.
The products between [v, η] ∈ A defined in section 4.1, and (ω, ψ) ∈ A are We first prove that the product is well-defined.Hence, very roughly speaking, A and A are dual under the product: their 'dimensions' match, since A is X × C modulo an equivalent relation, and A is X × C with a constraint.However, for A and A to be true dual, we have to reduce the regularity from Holder to dual regularities; we shall not do that in this paper.Lemma 5.For (ω, ψ) ∈ A, all (v, η) in a equivalent class in A give the same product ⟪v, η; ω, ψ⟫ on O and on K. Hence ⟪S(X); ω, ψ⟫ is also well-defined.
By definition, v ′′ and η ′′ are both bounded, and so L F v su = 0.However, due to the exponential growth, v su must be zero, otherwise it grows unbounded.Hence, v ′′ ∈ V c , so we can write it as v ′′ = −hF, where ∂h/∂t = η ′′ .By assumptions, v ′′ is bounded, |F | > C > 0, so h is bounded.This basically means that η ′′ has a zero average, otherwise h would be unbounded.
To prove the lemma on an orbit O, note the difference in the product is since h is bounded.
To prove the lemma on K, take an orbit such that its long-time-average of ωv and ψη equal their integrations to ρ, and change the lower bound of integration from −T to time zero.□ 5.2.Unstable contribution is in unstable subspace.Definition 6.On a compact mixing axiom A attractor with physical measure ρ and decay of correlations in eqs.( 6) and (7), define the shadowing contribution in the linear response of flows as where X = δf .Lemma 5 shows that this is well-defined.
We want the leftover part to be in the unstable direction.If so, we can recycle the dimension argument in [44], which is also reviewed in section 3.4, to say that the shadowing contribution could be a good approximation of the entire linear response when the unstable ratio is low.We do not want U C to involve the flow direction, which could be more easily captured than unstable and stable directions.For example, a perturbation in the time equals a perturbation in the flow direction.If U C precludes the flow direction, it is more likely to be small, and SC alone is more likely to be a good approximation of the linear response: sometimes this is indeed the case in practice, such as in [42].

Theorem 7 (shadowing/unstable decomposition of flows). Definition 6 equals
Hence, the unstable contribution, the leftover part of the linear response in eq. ( 5), is Proof.First let v and η be the specific shadowing pair in eq. ( 4).By the invariance of ρ, the stable and unstable part of the expression of SC in the lemma equals For the center direction, note F = ∂/∂t and the decorrelation in eq. ( 7), Hence the expression in the lemma equals the definition of SC in definition 6 with the specific v and η.By lemma 5, we further know that the choice of the shadowing pair does not matter.□ By eq. ( 9), it turns out the center direction has zero average Hence, U C has another expression involving the center direction, which is convenient for deriving the equivariant divergence formula of continuous-time: this will be clear in a forthcoming paper.

Adjoint shadowing for continuous-time
This section develops the adjoint shadowing theory for continuous-time systems.The main issue is the extra center subspace given by the flow direction.There is no exponential decay in the center subspace, but we can still obtain similar results.

Adjoint hyperbolicity.
With some extra notations, we have the same results as the discrete-time.Define the Riemannian derivative along a covector by: where L is the Lie derivative.By this definition, Since ωY is a scalar function, its Lie derivative equals Riemannian derivative, hence, For an orbit x t = f t (x 0 ), a homogeneous adjoint equation of ε t ∈ T * xt M can be written in three ways: Here the pullback operator f * t is the adjoint of f t * .The last expression is an ODE.Define the adjoint projection operator, P u , P s , and P c , similarly as discrete-time.As lemma 3, we can also show that their image spaces hyperbolically split the cotangent space.In particular, since F is non-zero on a compact set K, |f * t P c η| ≤ C|P c η|, for any t ∈ R.
The expression in (b) is similar to the discrete-time case but has the extra center direction.It is not apriorily clear that the extra center direction assembles well with the other directions, as now we have in (c).
In R M the ODE in (c) is the conventional adjoint equation, − dν dt = DF T ν + ω.To see the similarity of (c) with discrete-time, note that f * ν 1 − ν is the time-1 approximation of L F ν.
Proof.(b) ⇒ (c).Boundedness is due to the adjoint hyperbolicity.For the governing equation, first write the expression of ν at τ , Note that for fixed t, f * t−τ ω s t is a homogeneous adjoint solution, so We may also use Lie derivative to write this proof, but the notations with ∇ is more consistent with the backpropagation, or the pathwise perturbation equation, in R M .Further note that the lower bound of the integration interval is τ , so Similarly, For the center direction, ε c is also homogeneous, For the last term, note that an element in the one-dimensional center subspace V * c is uniquely determined by its product with F .Since ω(F )ε c ∈ V * c and ω(F To summarize, ν satisfies the ODE To check the extra constraint, just notice that f * t ω s ∈ V * s , so the integrations in ν has zero product with F .So ν(F ) = ψε c (F ) = ψ for all time: this is the stronger version of (c).
(c) ⇒ (b).For uniqueness, assume the weaker version of (c), that is, pick any orbit and any two bounded solutions of the ODE and ν τ (F τ ) = ψ τ at an arbitrary time τ , Then their difference ν ′′ is a bounded homogeneous adjoint solution of L F ν ′′ = 0 such that ν ′′ τ (F τ ) = 0.If ν ′′ is not zero in the stable or unstable direction, then it can not be bounded.Hence, ν ′′ = Cε c for some constant C.But ν ′′ = 0 at τ , so the difference is always zero.
(b) ⇒ (a).Use the specific shadowing pair in eq. ( 4).The stable and unstable part is the same as discrete time.For the center part, notice Also note ν c = −ψε c , so −ηψ = ν c X for all time.□ 6.3.Adjoint shadowing lemma.
By (a), we can say that S is the adjoint of S. We still have unique existence of S, but we do not prove that from from the standard adjoint operator theory for two reasons.First, the standard theory is a bit technical to invoke for A and A to be rigorously dual; neither does it provide much benefit, in particular, it does not immediately give the Holder continuity.Since we have (b) and (c), we have a more straightforward way to get unique existence and better regularity.
(c) ⇒ (b).Assume ν ∈ X * (K) satisfies the weaker version of (c), that is, L F ν = −ω on K and νF = −ψ for one x ∈ K. First, we can check that S(ω, ψ) indeed satisfies all the conditions in (c), so the set of potential ν's is not empty.Here S temporarily refers to only the expression in (b).
Since K is mixing axiom A, let O be a transitive orbit on K, that is, all points on K are limit points of O.We can see that ν = S(ω, ψ) + Cε c for some constant C on O. Otherwise ν would grow unbounded, violating the overall assumption that ν ∈ X * (K), or ν would not satisfy the center direction of the ODE (see the proof of the pathwise case).Hence, νF = −ψ + C on O.
Note that O does not necessarily contain x.But x is a limit point of O, so there are {x n } n≥0 ⊂ O such that x n → x.Note that by assumptions, ν, F and ψ are continuous on K, so By assumption, C = 0.So ν could only be S(ω, ψ).
The Holder condition of ν su follows from appendix A; ν c = −ψε c is Holder, so ν is also Holder.□ 6.4.Applications of adjoint shadowing lemma and discussions.
By section 5.2 and characterization (a), S adjointly expresses the shadowing contribution, a significant part of the linear response.
By (c), to numerically compute S(ω, ψ), solve the nonintrusive adjoint shadowing problem in continuous-time, where ε is u-many homogeneous adjoint solutions forming a basis of V * u , and ν ′ is an inhomogeneous solution such that ν ′ T (F T ) = ψ T .This problem mimics characterization (c).There are other ways to mimic, for example we previously used a least-squares version of this problem.The nonintrusive adjoint shadowing algorithm was used on fluid examples with u = 8, and M ≈ 3 × 10 6 ; we only computed SC, not the full linear response; the cost was on the same order of simulating the flow.The result is not accurate since we did not compute UC, but the result approximately reflects the relation between the averaged observable ρ(Φ) and the parameter γ, so only computing SC could sometimes gives something useful for engineering purposes [47].
By (b), the adjoint shadowing lemma is also important for computing U C on an orbit, where Here σ is the conditional measure on center-unstable manifolds, and δ cu Lcu σ is the center-unstable perturbation of the transfer operator on σ.In a forthcoming paper, we shall show that if Note that here div v ∇F is no longer the differential of div v F , but they still satisfy the constraint in A. This justifies our seemingly unnecessary involvement of ψ in the continuous-time shadowing theory.A 'discrete-adjoint' nonintrusive algorithm for computing SC of continuous-time linear response [8] was given before our nonintrusive adjoint shadowing algorithm [47].The discrete adjoint is obtained by transposing the linear system in our tangent nonintrusive shadowing algorithm [50].However, its cost is twice larger than ours, since it requires computing and saving and reloading all unstable tangent solutions, which are forward-propagating vectors, at checkpoints.Moreover, its theory was missing or less satisfactory.For example, its solution was not continuous; it did not define a dual operator; the boundedness and well-definedness (as a covector field) of its solution were unclear; its description is more complicated than ours and less similar to the conventional backpropagation method; it can not give a SC + U C decomposition for the linear response of finitely-long time system; finally, it can not be used in the unstable contribution of the linear response.

A future direction: adding randomness
The main drawback of our current works, including shadowing and equivariant divergence formulas, is that they involve the Jacobian matrices.When the Jacobian matrix has bad properties, for example non-hyperbolicity, both formulas are affected, although the algorithm for SC is more robust than U C. U C is more fragile since the equivariant divergence formula has the term S(div v ∇f * ), and both S and div v ∇f * are affected by non-hyperbolicity.For non-hyperbolic problems such as Lorenz 96, we can still compute SC in practice, but not U C.
The main thing to do next should be to overcome the non-hyperbolicity in general.On one end, it is known that some non-hyperbolic systems have no linear response, so we can not expect to compute it accurately [4,64].However, looking at the observable-parameter plot of such systems, it seems that there is still some trend between the γ and ρ(Φ) [46].So, for practical purposes, we may still ask how to compute an approximate linear response which reflects this trend, but also with the least approximation error.Such an approximation would still be very useful for practical purposes such as computational design optimization.
It seems that a plausible idea is to add random noise when the hyperbolicity is bad.This has two benefits, the first is that the linear response of the random dynamical system has another formula, the kernel differentiation formula (also called the likelihood ratio method by probabilists).This formula differentiates only the kernel but not the Jacobian matrix, so the bad properties of the Jacobian, including non-hyperbolicity, does not affect the algorithm at all [57,55,27,32].This formula admits an 'ergodic theorem' version, which runs on only one orbit, and the size of the integrand does not grow with the orbit length [46].Another benefit is that the random noise helps to quickly move out of non-hyperbolic regions.In particular for singular hyperbolic flows, non-hyperbolicity is typically associated with a stagnant point, and it takes a long to move out of a neighborhood (which is actually the cause of the non-hyperbolicity).
We should also try to add minimal amount of noise to reduce the associated error.However, the kernel differentiation formula is expensive when the noise is small, so we can not add a small noise everywhere.So it seems that we have to add noise only to local regions where the deterministic formulas do not work, and we use the kernel formulas only occasionally.Since the area of 'local' regions tends to be small when the dimension is high, it seems that we can have a small noise-induced error by adding a large noise to only a small region.
It seems that there are two major challenges in this random-deterministic switching program.The first is to extend the pathwise perturbation formula and divergence formulas to random dynamical systems.The continuous-time case is more difficult since we need to deal with stochastic calculus.In particular, it requires extra care to recover the time-reparametrization trick which we used for continuous-time shadowing.The second difficulty is to let the stochastic and deterministic formulas communicate.The Bismut-Elworthy-X.Li formula is an example where the pathwise perturbation formula communicates information to the kernel differentiation formula under perturbation to initial conditions [6,19,56], but we need more formulas for other types of communications for parameter perturbations, which might be more difficult.
Moreover, there should be other problem-specific ways to add noise.For dynamical systems obtained from PDE, we might attempt to add space-time noise according to features in the physical space.For example, we might try to add noise only to large vortices.any x, y with d(x, y) ≤ δ, there is a chart containing both x and y, the goal is to show that v is Holder under this chart.
Let Bound the terms in the second sum by triangle inequality, so |f n * P s X −n (x) − f n * P s X −n (y)| ≤ Cλ n where C depends on the hyperbolicity constant, the sup operator norm of P s , and the sup of X, all taken on the entire K.
For the nth term, n ≤ N , for any 0 ≤ k ≤ n, we can write f s (x −k ) and f s (y −k ) as matrices under the same chart U −k , where f s ∶= f * P s .We can also write X(x −n ) and X(y −n ) as Euclidean vectors in chart U n .Apply the finite-difference version of Leibniz rule, Since f s is a Holder continuous operator on K, there are two constants C and α on all U i 's, such that |f s (x) − f s (y)| ≤ Cd α (x, y) for any x, y in the same chart.Let α, C also be the Holder constants for X, and let λµ α ≠ 1.Note that Here C changes from line to line, but each C is uniform for all x, y in K if d(x, y) ≤ δ.As we can see, this proof works on a somewhat more general formula, which not only applies to shadowing formulas, but also applies to the tangent version of the equivariant divergence formula in [43].We say that a map g ∶ T M → T M covers f on K, if for all x ∈ K, g x is a map from T x M to T f x M; we sometimes omit x and write only g.We say g is Holder if it is holder under all coordinate charts.We say g exponentially decays if there is 0 < λ < 1 and C > 0, such that |g n | < Cλ n .For example, f s = f * P s in the proof above is exponentially decaying and Holder continuous.
Lemma 11 (Holder for 'decay-sum' scheme).If f is a bijection map on a compact set K, f and f −1 are Lipschitz continuous; g covers f , h covers f −1 , g and h are Holder continuous and exponentially decaying; Y is Holder continuous.Then the following vector fields on K are Holder continuous: To prove this, just replace f * P s by g, X by Y , v by S 1 in the proof above.For S 2 , repeat the proof with f −1 * P u replaced by h.

3 . 4 .
(a) ⇒ (b).If there is another S ′ satisfying the duality in (a), then ρ((S(ω) − S ′ (ω))X) = 0 for any X ∈ X α .Hence S(ω) = S ′ (ω) ρ-almost everywhere.Appendix A proves that S preserves Holder continuity.□ Applications of adjoint shadowing lemma.The three characterizations are useful for different purposes.Roughly speaking, (a) and (b) tell us what S could be used for once we can compute it, whereas (c) tells us how to compute S.
µ ∶= max{|f * |, |f −1 * |}, where | ⋅ | is the operator norm.Let N (x, y) be the number such that µ N d(x, y) < δ ≤ µ N +1 d(x, y).So for each −N ≤ k ≤ N , x k ∶= f k x and y k are in the same chart, denoted by U k .Partition the sequence in the expansion ofv, v = ∑ 0≤n≤N f n * P s X −n + ∑ n≥N +1 f n * P s X −n − ∑ −N ≤n≤−1 f n * P u X −n − ∑ n≤−N −1 f n * P u X −n .We only look at terms with n ≥ 0; the n ≤ −1 part is similar.Take v(x) − v(y) under the chart containing x and y,v(x) − v(y) = ∑ 0≤n≤N f n * P s X −n (x) − f n * P s X −n (y) + ∑ n≥N +1 f n * P s X −n (x) − f n * P s X −n (y) − ∑ n≤0 ⋯.