A quantum inspired approach to learning dynamical laws from data—block-sparsity and gauge-mediated weight sharing

Recent years have witnessed an increased interest in recovering dynamical laws of complex systems in a largely data-driven fashion under meaningful hypotheses. In this work, we propose a scalable and numerically robust method for this task, utilizing efficient block-sparse tensor train representations of dynamical laws, inspired by similar approaches in quantum many-body systems. Low-rank tensor train representations have been previously derived for dynamical laws of one-dimensional systems. We extend this result to efficient representations of systems with K-mode interactions and controlled approximations of systems with decaying interactions. We further argue that natural structure assumptions on dynamical laws, such as bounded polynomial degrees, can be exploited in the form of block-sparse support patterns of tensor-train cores. Additional structural similarities between interactions of certain modes can be accounted for by weight sharing within the ansatz. To make use of these structure assumptions, we propose a novel optimization algorithm, block-sparsity restricted alternating least squares with gauge-mediated weight sharing. The algorithm is inspired by similar notions in machine learning and achieves a significant improvement in performance over previous approaches. We demonstrate the performance of the method numerically on three one-dimensional systems—the Fermi–Pasta–Ulam–Tsingou system, rotating magnetic dipoles and point particles interacting via modified Lennard–Jones potentials, observing a highly accurate and noise-robust recovery.


Introduction
Discovering dynamical laws that govern the time evolution of dynamical systems has been a central task in physics and engineering for centuries, from practical as well as fundamental perspective.Historically, this task has been approached from two directions -firstly by using expert physical knowledge and intuition, and secondly by using data obtained by measuring time evolution of the dynamical system in question.With the ever growing availability of large amounts of computational power and data, the second approach is becoming increasingly accessible [SL09, BPK16, GKES19, GGR + 20, IMW + 20, CPSW21, KBK22, CDA + 21].However, since a crucial aspect of using data to learn dynamical laws is choosing the right hypothesis class, physical intuition about the system is essential in developing efficient data-driven algorithms.
One prominent recent approach is the sparse identification of non-linear dynamics (SINDy) algorithm [BPK16, SBK21, dSCQ + 20].Here the learning task is phrased as a linear inversion problem for a chosen function dictionary.To arrive at a physically motivated hypothesis class, the authors impose sparsity of the recovered dynamical laws with respect to this dictionary-implementing the principle of Occam's razor.
The remarkable success of the SINDy algorithm demonstrates that imposing structure in learning dynamical laws is immensely powerful.This is in spite of the fact that Occam's razor is a general heuristic principle that is not linked to any specific physical properties of the system at hand.
Building on the ideas of SINDy and its variant MANDy [GKES19], in ref. [GGR + 20] the authors proposed to use locality in one dimension as the structure imposing physical principle.They have shown that the resultant hypothesis class consists of low rank tensor trains (TT) [BSU16], a specific type of an efficient tensor network representation of multivariate functions [CSS15,SS16,LYCS17].A similar result is widely known in the quantum many-body literature [CPGSV21,Oru14,VC04,HMOV14], where it has been shown that low rank TTs (known in this context as matrix product states) parametrize ground states of one-dimensional local Hamiltonians [VC06], as well as other states of physical importance [ECP10, SWVC08,VMC08].
Another structural observation made by the quantum manybody community is that the TT representations of quantum states that are symmetric under the local action of some symmetry group have tensor cores with a characteristic sparse support, a property dubbed block-sparsity [SPV10].Recently, it has been proven that functions with bounded polynomial degree (or an equivalent notion suitable for the chosen function dictionary) also admit block-sparse TT representations [GST21,BGP21].Being able to control the polynomial degree of non-linear functions is in the context of learning dynamical laws a promising primitive.On a technical level, limiting the total polynomial degree offers a natural truncation of a multivariate function space via Taylor's theorem.On a conceptual level, high-degree polynomials in dynamical laws correspond to terms that vary quickly with respect to many modes.The appearance of such terms in dynamical laws is generally a sign that a better choice of coordinates parametrizing the state space could be made.Similar arguments also apply to other dictionaries, e.g., for trigonometric functions limiting the total "degree" corresponds to neglecting fast oscillating terms in the dynamical law.
Bounding a notion of total degree suitable for a chosen multi-variate function dictionary, block-sparse TTs arise as a natural, physically well-motivated efficient restriction of the ansatz class for dynamical laws.In another context, block-sparsity has been observed to improve the performance of tensor network optimization methods, in terms of computational, memory and sample complexity [GST21,BGP21].
The contribution of this work is three-fold: (1) We broaden the range of physical principles that lead to efficient TT representations of dynamical laws.In particular, we show that systems with K-mode interactions admit an efficient TT representation and that systems with algebraically decaying interactions can be approximated with bounded error by a system with efficient TT representation.(2) We utilize block-sparsity in the context of learning dynamical lawsdemonstrating a significantly improved performance of the resulting method compared to previous work.(3) We use self-similarity between certain modes in the system to further restrict the search space and develop a new optimization algorithm for the resultant hypothesis class, referred to as ALS optimization with gauge mediated weight sharing (ALS-GMWS), which is inspired by similar notions in machine learning [RCR + 20].We show numerically that (2) and (3) improve scalability of the method, allowing us to learn dynamical laws of systems more than three times larger than the systems presented in ref.
The remainder of this work is structured as follows: Our setting of learning dynamical laws is formally introduced in Section 2. In Section 3 we give a brief primer into tensor networks.Section 4 shows how a natural truncation of the function space leads to an efficient parametrization via block sparse TTs.In Section 5 we prove that low rank TTs parametrize dynamical laws of systems obeying generic physical principles.The concept of self-similarity and the ALS-GMWS algorithm is presented in Section 6.Finally, we numerically demonstrate the performance of our method in Section 7, before concluding in Section 8.

Setup
We will use the notation [N ] := {1, . . ., N } for any integer N .Consider a dynamical system with state space S ⊂ R d , such that its state is described by a real d-dimensional vector x = (x (1) , . . ., x (d) ).We call each of the degrees of freedom x (i) a mode.The time evolution is a smooth curve x(t) : R → S, which is, given initial conditions x 0 , generated by where we have restricted our attention to time-independent systems.We assume that it is known which of these forms governs the dynamical system of interest and we will refer to both eq.( 1) and the function f : S → R d as the dynamical law.The first form of eq.(1) can arise, e.g., from Hamilton's equations, where f (x) = {x, H} with H the Hamiltonian and {•, •} the Poisson bracket, while the second form appears, e.g., in Newton's equations, where f k (x) is the total force acting on the k-th mode in the state x.By learning a dynamical law we mean identifying f (x) given data pairs The data may be coming from time-series measurements of a trajectory with the gradients y i approximated, e.g., by the method of finite differences.
To learn f , we choose a function dictionary {Ψ i : R → R} i∈[p] of linearly independent functions.Forming a product basis we obtain the space with elements mapping R d to R, so that A d can be used as a search space for f : R d → R d .
Elements g(x) ∈ A are labeled by tensors Hence, the dimension of A is exponential in the system size d, limiting the scalability of learning a function within A. This is an example of the notorious curse of dimensionality.In order to overcome the curse, we use structural constraints on Θ due to known physical properties of the system to identify a physically relevant subspace H ⊂ A with dim(H) ∈ O(poly(d)), defining the physical corner of solutions.Specifically, in this work, we show that for many dynamical systems of interest a suitable H is the set of low rank block-sparse TTs, as defined in Section 3.

A primer to tensor networks
To set the stage and notation, we start with a brief introduction to tensor networks.For a more thorough treatment of this topic, see ref. [BSU16,BC17].The curse of dimensionality, as discussed in the previous section, renders high-order tensors hard to work with.Not only are they hard to optimize over, even storing them and performing basic operations on them quickly becomes impractical as the order grows.Tensor networks are a way of decomposing high-order tensors in terms of contractions of smaller tensors.Limiting the ranks of these contractions we define subspaces of tensor spaces with dimensions scaling polynomially in the order, averting the curse of dimensionality.Crucially for us, such subspaces turn out to contain physically relevant tensors in many cases of interest.
Tensor networks can be visually represented by tensor network diagrams.Suppose an order d tensor T , which is a contraction of tensors {C i } i , called tensor cores.To draw a tensor network diagram for T , we draw a node for each tensor core C i .For each of the indices of the tensor cores we draw an edge and connect those edges that represent indices that are contracted over in T .The indices represented by connected edges are referred to as virtual indices.The remaining d unconnected edges represent the d physical indices of T .
A widely studied example of such a network is the tensor train, which is represented by the tensor network This contraction of tensors can be written in matrix notation as where the superscripts on the tensor cores are the physical indices, while the subscripts are the virtual indices.Any tensor T ∈ R d p can be represented as a TT, if we allow r k to scale exponentially with d.To obtain an efficient ansatz class that does not suffer from the curse of dimensionality, we have to bound r k by a polynomial in d.We call r = min k∈[d] r k the TT rank of such a decomposition of T .
An important observation is that the TT decomposition is not unique.The gauge transformation applying a matrix multiplication with invertible matrices A ℓ on the virtual indices of the tensor cores, leaves the TT invariant.Part of this gauge freedom is removed by imposing the so-called left canonical condition ∀i ∈ {2, . . ., d − 1}.The remaining gauge freedom can be shown to be a unitary transformation.

Block-sparsity
In choosing the finite local dictionary, we are truncating the univariate function space, e.g.bounding the polynomial degree or a similar notion relevant to the dictionary at hand.However, if we now take the d-fold tensor product A of the univariate dictionaries as the search space for a multivariate f k , we introduce terms that are of higher total degree than the imposed local truncation.Thus, from the perspective of multivariate Taylor series, the multivariate dictionary appears inconsistently truncated.Here, it is more natural to work with complete function spaces of bounded total degree.In this section we show that such a natural restriction can be conveniently captured by enforcing a certain support pattern of the TT cores, yielding so-called block-sparse TTs.This structure allows us to allocate resources much more efficiently-we can enlarge the dictionary while keeping the problem-relevant expressivity of the ansatz class and the computational resources required constant.
More formally, we define a degree map w : [p] → N 0 , which assigns the degree to the elements of a given function dictionary.Without loss of generality we will assume that w is a non-decreasing function.The degree map provides us with a natural definition of the Laplace-like multivariate degree operator L : R p d → R p d , which acts on tensor representations of multivariate functions, where Ω = diag(w(1), . . ., w(p)) and Id p is the p × p identity matrix.This operator is analogous to the bosonic particle number operator in quantum physics.
Suppose a tensor ϕ with a TT representation with cores {C ℓ } d ℓ=1 .We define the left and right interface operators and the left and right interface tensors for each interface at the ℓ-th tensor core.When using matrix notation for the interface tensors, we will think of them as linear operators ϕ <ℓ : R r ℓ−1 → R p ℓ−1 and ϕ >ℓ : R p d−ℓ → R r ℓ .In ref. [GST21], the following theorem has been shown.Theorem 1 (Block-sparsity).Suppose a TT ϕ with tensor cores {C ℓ } ℓ∈ [d] in left canonical form with minimal ranks where L is as in eq.(9) and λ ∈ N 0 .Then there exist a unitary gauge transformation {A ℓ } ℓ∈[d−1] acting via eq.(7), such that for each interface ℓ ∈ [d − 1] the transformed interface tensors satisfy for a set of diagonal matrices Λ >ℓ ℓ∈[d−1] with nonincreasing diagonal entries.
We provide a slightly simplified proof in Appendix D.
Let us discuss the consequences of Theorem 1 and see how it implies block-sparse structure of the tensor cores.First, notice that eq. ( 14) is an eigenvalue equation, i.e. it states that the rows of ϕ >ℓ are left eigenvectors of L >ℓ with eigenvalues given by the corresponding diagonal element of Λ >ℓ .Similarly, eq. ( 15) implies that the rows of ϕ <ℓ are right eigenvectors of L <ℓ with eigenvalues given by the corresponding diagonal elements of Λ <ℓ := λ Id −Λ >ℓ−1 .
Choosing any interface ℓ ∈ {2, . . ., d − 1}, we can decompose the eigenvalue equation as ) with Ω = diag(w(1), . . ., w(p)), where the double line collects multiple indices into a single edge.Since we assume that the TT ranks are minimal, ϕ <ℓ (ϕ >ℓ ) have full column (row) rank and the eigenvalue equation implies (17) If we fix the remaining physical index to i ∈ [p], we obtain the matrix equation Therefore, for all i ∈ [p], we can assign each block-row of (C ℓ ) i to an eigenvalue of L <ℓ by which it is multiplied in the first term on the RHS of eq. ( 18).Similarly we can assign each block-column of (C ℓ ) i to an eigenvalue of L >ℓ by which it is multiplied in the second term on the RHS of eq. ( 18).Now, eq. ( 18) tells us that blocks of (C ℓ ) i , which correspond to eigenvalues λ <ℓ , λ >ℓ of L <ℓ and L >ℓ respectively, can be non-zero only if ) Hence, each non-zero block of C ℓ connects eigenvectors of L <ℓ with eigenvectors of L >ℓ to fulfil the eigenvalue equation Lϕ = λϕ.In this way Theorem 1 implies blocksparsity of the tensor cores of an eigenvector of L.
In order to analyze the maximum block sizes, consider a block of C ℓ that corresponds to the eigenvalues λ <ℓ and λ >ℓ .The block size is limited by the number of times each eigenvalue appears in the spectrum of L <ℓ and L >ℓ , respectively.The interface operators are diagonal, with elements given by respectively.Note that if we allowed w to take negative values, the maximum block sizes and number of blocks would become very large.
To obtain a low-rank block-sparse TT, we enforce the block-sparse structure and limit the block sizes to some maximum value ρ.
Example 1: Monomial dictionary.Consider the monomial dictionary (21) and the polynomial degree function w : i → i − 1, so that Image(w) = {0, . . ., p − 1}.For physical index i at the ℓ-th mode, there are λ − i + 2 solutions λ <ℓ and λ >ℓ to eq. ( 19), so the matrix (C ℓ ) i has λ − i + 2 non-zero blocks.A combinatorial argument shows that the number of solutions to eq. ( 20) is which gives us the maximum block sizes.For concreteness, choose λ = 3 and p = 3.Now the block-sparse structure becomes where * indicates the non-zero blocks.Here the i-th row corresponds to λ <ℓ = i − 1 and the j-th column to λ >ℓ = 4 − j.
Limiting the degree.Theorem 1 states that fixed degree functions admit a block-sparse TT representation.An often more natural ansatz class are functions with bounded degree.Fortunately, such functions also admit a block-sparse TT, as can be seen from the following argument.
Suppose a function f : R d with a bounded degree λ.
Hence, it can be written as a linear combination where each f j has a fixed degree ≤ λ and, hence, admits a block-sparse decomposition.The number of terms n(λ) in this sum is bounded by the number of options for a degree ≤ λ.Suppose that each f j has a block-sparse TT decomposition {C . Now we can represent the function f by a TT with tensor cores {C ℓ } ℓ∈ [d] , where for each ℓ ∈ {2, . . ., d − 1} and a value i ∈ [p] of the physical index, the corresponding matrix , the tensor core C 1 is given by concatenating C (j) 1 and the tensor core C d is where d as its j-th block, with all other blocks zero.We sum over the blocks corresponding to each f j by contracting the right index of C d with a vector of all ones 1 n(λ) .Hence, the TT representation of f becomes which is a block-sparse TT.

Efficient TT representation of dynamical laws
In this section we will show how generic properties of dynamical systems imply efficient TT representations of their dynamical laws.We write the function f : S → R d in eq.(1) using the decomposition (4) as We call the TT decomposition of Θ k the TT representation of f k .Combining the tensors for k ∈ [d], we obtain the tensor Θ ∈ R d×p d , which can be written as a single TT via (30) providing a TT representation of the dynamical law.
In order to show that a function admits an efficient representation, we will bound its separation rank, defined as follows.
Definition 2 (Separation rank).Suppose a multivariate function h(x) : R d → R. We say that h(x) has separation rank s with respect to a bipartition has s = s.
It can be shown that the minimal ranks r k of a TT representation of a function h(x) are equal to the separation ranks with respect to P k .For a formal proof of this statement, see ref. [HRS12].
The first generic property of dynamical systems that has been used in [GGR + 20] to show efficient TT representations of dynamical laws is locality in one-dimension.Definition 3 (One dimensional interacting system, Definition 4 in ref. [GGR + 20]).A dynamical system, governed by the dynamical law f : S ⊂ R d , is one-dimensional with interaction length L and separation rank N , if there exists a function set {g i (x)} p i=1 and for each k (32) where we set g i = 1 for i ≤ 0 and i ≥ d + 1.
By bounding the separation ranks of the dynamical laws of one dimensional interacting systems with respect to bipartitions P k , the authors prove the following theorem.For many systems of interest exact locality as in Definition 3 is too strong of an assumption.We would like to be able to use TTs also for systems that don't have a sharp bound on the interaction length, but where instead the interactions decay with distance.This is formalized by the following definition.
Note that in this definition we demand that the state space S = [0, 1) d .This is important so that ∥g L ∥ 2 does not change with d.For these systems, we can show the following theorem.Theorem 6 (Approximate locality).Suppose a onedimensional system with (χ, g)-algebraically decaying interactions and separation rank N , governed by the dynamical law f : [0, 1) d → R d , that can be written as where {g k,L } k,L∈ [d] satisfies the assumptions of Definition 5.If χ > 1, then for any L ∈ [d] there exists a one-dimensional interacting system with interaction length L and separation rank N L, such that where The proof is given in Appendix A.
Theorem 6 allows us to approximate systems with algebraically decaying interactions by strictly local systems with bounded error, which is independent of d.1 This is formalized in the following corollary.
Corollary 7 (Low rank TTs for algebraically decaying interactions).Suppose a system with (χ, g)-algebraically decaying interactions with separation rank N governed by the dynamical law f (x) : S ⊂ R d → R d .Furthermore, suppose that χ > 1.Then there exists an ε-approximate The details of the proof are given in Appendix B.
Many systems of physical interest are K-mode interacting systems, formalized by the following definition.Definition 8 (K-mode interacting systems).A dynamical system, governed by the dynamical law f : S ⊂ R d , is K-mode interacting with separation rank N , if there exists a function set An example of a 2-mode interacting system is, e.g., a collection of gravitationally interacting particles, where the total force on each particle is given by the sum of pairwise forces with respect to the remaining particles.We will now show that if a system is K-mode interacting with K ⪅ 5, it admits an efficient TT representation.Theorem 9 (Efficient TT representation of K-mode interacting systems).Suppose a K-mode interacting system with separation rank N , governed by a dynamical law 2. the function f admits a TT representation with ranks To prove Theorem 9, we adapt the techniques of the proof of Theorem 4 from ref.
Already for K ⪆ 5, although c 2 (N, d, k) is polynomial in d, the polynomial degree makes working with such TTs prohibitively expensive even for modest d.However, many systems of physical interest are known to have K = 2, 3, 4 and hence admit an efficient TT representation.It is important to note that this result does not rely on the underlying systems being one-dimensional.Finally, note that if a dynamical system has at most K-mode interactions, it still admits an efficient TT decomposition, since the rank of a TT is sub-additive.
General conditions for the approximability of multivariate functions by a low rank TT in terms of tail control of the singular value spectrum of the matrix unfoldings of their coefficient tensors are derived in Ref. [BSU16].As we discuss in Appendix E, these results are in a precise sense analog to the control of matrix product state approximations of quantum states based on entropy scaling conditions derived in the quantum many-body literature [SWVC08,VC06].
6 Gauge mediated weight sharing (ALS-GMWS) 6.1 Self-similarity Additional structure in the system, known prior to learning, can cause certain modes to play the same role in dynamical laws for multiple modes, implying that we would like the corresponding tensor cores to be equal.We call this selfsimilarity.The different roles each mode can play are referred to as activation types.For example, in the case of a one-dimensional dynamical system with interaction length L, Definition 3, the j-th mode plays the same role in all functions f k with k < j − L, namely that the mode is to the left and outside of the interaction range.Similarly the role of the mode is the same for all k > j + L. Hence, such systems have 2L + 3 activation types.
Self-similar systems with α activation types can be described by a set of dα tensor cores {C . The recipe to build the corresponding tensor train representations of the dynamical laws where block-sparse representation of functions with bounded degree eq. ( 28) is used.We call such systems S-self-similar.
In the case of one-dimensional interacting systems with interaction length L, the selection table takes the form (41)

ALS optimization of self-similar systems
In learning dynamical laws, given data ) ∀i, we would like to identify an element f of some ansatz class that minimizes the empirical loss In the previous sections we have shown that a natural choice of an ansatz class for many dynamical systems of interest are block-sparse low rank TTs with self-similarity given by the selection table eq. ( 41).
Previously, in ref.
[GGR + 20], alternating least squares (ALS) optimization (and a rank-adaptive variant [GK19]) has been used to minimize L emp over low rank TTs.In the ALS procedure, the tensor cores are iterated over in sweeps, at each step solving a linear least squares problem to minimize the empirical loss as a function of the given core, until convergence.The ALS algorithm can be adapted to block-sparse tensor trains by restricting each contraction in the algorithm to indices labeling elements that are non-zero in the block-sparse structure [GST21].
However, optimization over systems with self-similarity requires more care.In ref.
[GGR + 20] a selection tensor approach has been taken, where the dynamical law is written as where θ k is the TT representation of f k and T is the selection tensor that, given k ∈ [d], picks the correct activation type.In this form, block-sparse ALS can be directly applied.However, this method leads to a numerical instabilities in the optimization, limiting its scalability to d ⪅ 20.
The core problem is that the unitary gauge freedom is not properly taken care of.To see this, suppose that the ALS sweeps are performed from left to right and that the tensor core C (j) ℓ corresponding to the mode ℓ ∈ [d] and activation type j ∈ [α] is being optimized.The left neighbour of this tensor core can have a different activation type for different choices of k.Since each of the left neighbour activation types has been optimized in a separate ALS step, they are each written in a different gauge.Using the form eq. ( 43) does not allow the gauge differences to be corrected for and therefore we rely on the gauges to converge as more ALS sweeps are performed.
We will now introduce a new algorithm for optimization of block-sparse low rank TTs with self-similarity, called ALS with gauge mediated weight sharing (ALS-GMWS), that allows the left neighbour gauge to be adjusted in each step.It will be useful to define restricted empirical cost, which, given a subset E ⊂ [d], is given by where the notation (v) E for v ∈ R d denotes the restriction of v to the subspace defined by the set of index values E.
Assume without loss of generality that the ALS sweeps are performed left to right.Suppose we want to optimize the core C be the set of indices of functions where this core is used, as determined by the selection table S. If ℓ = 1, we don't need to worry about the gauge at all, since we are optimizing left to right.We simply perform a block-sparse ALS step to find C (j) ℓ that minimizes L E emp ( f ) with all the other tensor cores fixed.Now consider ℓ ≥ 2. Among the functions labeled by e ∈ E, we are only sure that the gauge is the same in the equivalence classes defined by the values of S e,ℓ−1 , or, in other words, only if the left neighbour, the tensor core of the ℓ − 1-th mode, has the same activation type.Hence, we divide E into disjoint sets {E a ⊂ E} a∈[α] labeled by the activation type of ℓ − 1-th mode, such that, for each a ∈ [α], S e,ℓ−1 = a for all e ∈ E a .In order to use the most information available, we find ã ∈ and perform the block-sparse ALS step to find the C (j) ℓ that minimizes L Eã emp ( f ), with all the other tensor cores fixed.Now that we found the optimal C (j) ℓ with respect to E ã, we optimize the gauge of C (a) ℓ−1 for a ̸ = ã, so that using the newly found core in the corresponding functions is justified.Hence, for each a ∈ [α], such that a ̸ = ã and |E a | ̸ = 0, we want to find a gauge fixing unitary U a that minimizes L Ea emp ( f ) under the transformation with all other tensor cores fixed.In order to preserve the block-sparse structure, U a needs to be block-diagonal with block sizes given by the block-column widths of C ℓ .In practice, we optimize over all block-diagonal matrices, although optimization over block-diagonal unitaries is possible and could lead to improvements.
The ALS-GMWS algorithm is summarized in Algorithm 1.

Numerical experiments
We will demonstrate our method on three example dynamical systems: The Fermi-Pasta-Ulam-Tsingou (FPUT) system, one-dimensional chain of rotating magnetic dipoles Algorithm 1: ALS optimization with gauge mediated weight sharing that minimizes L Ea emp after the transformation eq. ( 45) with all the other tensor cores fixed; Set C (a) and a chain of atoms interacting via a modified Lennard-Jones interaction.
FPUT system.The FPUT system is a chain of non-linear springs with spring constants κ ℓ .The dynamical laws are given by which is a one-dimensional system with interaction length L = 1 and, using the monomial dictionary eq. ( 21), separation rank N = 4, such that each f k can be represented by TTs with rank bounded by r = 4. Furthermore, the polynomial degree of the equations is bounded by 3 and, hence, we can use TTs (28), with block-sparse structure given by eq. ( 23) with λ = 3 and block sizes bounded by 4, to represent the system exactly.
Rotating magnetic dipoles.Here, we have a chain of magnetic dipoles at positions X ℓ with magnetic dipole moments M ℓ and moments of inertia I ℓ .They are free to rotate in the plane perpendicular to the chain and their angles of rotation are x ℓ ∈ [0, 2π).The dynamical laws are given by The positions are chosen so that . This is a 2mode interacting system, Definition 8 with separation rank N = 2, using the trigonometric dictionary (24).Hence, it suffices to use rank r = 2(d − 1) TTs.Since the degree given by w(1) = 0, w(1) = w(2) = 1 is bounded by 2, we can use the block-sparse structure with block sizes bounded by 2(d−1) and represent the laws exactly.However, this is also a system with 3, 1 2 √ 2πalgebraically decaying interactions (after rescaling x ℓ → (2π) −1 x ℓ , so that x ℓ ∈ [0, 1)) and separation rank 2, Definition 5. Hence, Corollary 7 allows us to limit the block sizes to a constant (in d) and get an approximation of the dynamical law with block-sparse TTs.
Lennard-Jones chain.The final example is a chain of particles of masses m ℓ that interact via a modified Lennard-Jones potential.This is the hardest example to learn of the three.The dynamical laws for the positions x ℓ of the particles along the chain are given by where ε k,ℓ , R k,ℓ are parameters of the interaction between the modes k, ℓ and we set q = 2.We set Since it is hard to approximate inverse functions with polynomial dictionaries, we learn instead of eq. ( 49) from the accordingly transformed data This is a 2-mode interacting system with separation rank with respect to polynomial dictionaries N = p 2 , since polynomial expansions of inverse functions contain an infinite number of terms.By Theorem 9, we can represent the dynamical laws by TTs with rank bounded by r = p 2 (d − 1).For good approximations of the inverse function, we require large p, so for practical use-cases we would like to limit the rank more.This is justified by Corollary 7, since this is also a system with (2, g)-algebraically decaying interactions and separation rank p 2 , if there is a finite amount of energy in the system and the initial conditions are chosen so that we can ensure that the state space is S ⊂ × ℓ∈[d] I ℓ , where I ℓ are finite intervals for all ℓ ∈ [d].
The polynomial degree of eq. ( 49) is bounded by 2p, such that we can use block-sparsity, where each core (C ℓ ) i has non-zero blocks only on the i-th diagonal.
For completeness, we include the formula for the total energy in the system (51)

Results
For each of the three example systems, we randomly draw data D = {x i } i∈[M ] from the corresponding S and compute , where the elements of η σ i are drawn from the Gaussian distribution with standard deviation σ.Unless stated otherwise, we use noiseless data with σ = 0. Therefore, the data that we use for learning does not come from trajectories of the dynamical systems, but instead they are random (potentially noisy) evaluations of the dynamical law f (x).This somewhat simplifies the setting, especially since we do not have to approximate ẋ(t) or ẍ(t) in eq.(1) from the trajectory, using e.g.finite differences.However, since we can hope to learn the dynamical laws only from trajectories that sufficiently explore the state space S, sampling S at random is not too different from using such trajectories.
Given D, we use ALS-GMWS to find a S L -self-similar low rank block-sparse TT representation of an estimate f of the dynamical law.We benchmark the quality of the estimate with respect to the true dynamical law f via the residuum where {x ′ i } i∈[M ′ ] are random samples from S, which are different to the samples used for training.In particular, we use M ′ = 2 × 10 4 .
All experiments have been conducted on consumer grade hardware and the code has not been optimized for speed.
The block-sparsity used for polynomial dictionaries is that non-zero blocks of (C ℓ ) i are on the i-th block-diagonal, while for the trigonometric dictionary eq. ( 24) it is such that (C ℓ ) 0 is block-diagonal, while (C ℓ ) 1 and (C ℓ ) 2 have non-zero blocks on the first block-diagonal.In both cases, Residuum after 10 ALS-GMWS sweeps is plotted against the size of the learning set.We use degree 3 Legendre polynomial dictionary, maximum block size ρ = 2 and selection table interaction length L = 5.
if we bound the degree by λ, the number of block-rows and block-columns is λ + 1.The maximum block size is ρ.For all experiments we have used self-similarity given by the selection table S L defined in eq.(41).
FPUT system.To recover FPUT systems eq.( 46) with size d = 50, we use degree 3 Legendre polynomial dictionary.In Fig. 1 we show the recovery of the translationally invariant FPUT system and FPUT system with randomly sampled spring constants.We plot the residuum achieved after 10 ALS-GMWS sweeps, using varying numbers M of training samples.Both systems are successfully recovered using around 2 × 10 3 training samples.
Rotating magnetic dipoles.We perform three experiments on the rotating magnetic dipole chain.
First, we use the trigonometric dictionary to recover chains with 10 ≤ d ≤ 50.In Fig. 2 we plot the residua achieved using varying numbers M of training samples.We show the results for L = 5, 9.For d = 50, successful recovery requires around 4×10 2 training samples for L = 5 and 9× 10 2 training samples for L = 9, which however achieves around 10 times smaller residuum.
Second, we use degree 9 Legendre polynomial dictionary to recover chains with d = 10, 20, 30.The residua for varying numbers M of training samples are shown in Fig. 3.For d = 30, we require around 1.7 × 10 4 samples.This demonstrates the importance of choosing an appropriate dictionary, when compared with the previous experiment.
Finally, we use trigonometric dictionary to recover chains with d = 20, 50 from noisy data with varying levels σ of noise.We plot the residua achieved using varying numbers M of training samples in Fig. 4. The results show recovery of the system down to the noise level, demonstrating noise robustness of the proposed method.
Lennard-Jones chain.We recover the Lennard-Jones chain eq. ( 49) for d = 10, using degree 8 Legendre poly-  Residuum after 8 ALS-GMWS sweeps on system sizes d = 10, 20, 30 is plotted as a function of the learning set size.The maximum block size is set to ρ = 3, dictionary is truncated at degree 9 and selection table interaction length to L = 5. nomial dictionary.We set L = 5.Fig. 5 shows the residua after 8 ALS-GMWS sweeps for various maximum block sizes, as a function of the training set size.This is clearly the hardest example, requiring around 6 × 10 3 samples for successful recovery of even such a small system.Furthermore, the plot shows three initialization instances when the algorithm has not converged well.In practice it is therefore sometimes beneficial to run the algorithm multiple times with different initializations.

Conclusions
Learning dynamical laws is a key task that has been moving into the focus of attention.The well-known and immensely popular SINDy approach introduces a data-driven We use degree 8 Legendre polynomial dictionary and set selection table interaction length to L = 5. algorithm for obtaining dynamical systems from data.This approach allows for a reliable recovery for small systems, but is not scalable to a large number of degrees of freedom: For this, a meaningful, physically motivated restriction of the hypothesis class is necessary.In this work, we have overcome this obstacle.We have shown that block-sparse tensor trains (TT) with self-similarity provide a suitable efficient ansatz class for learning dynamical laws from data in many contexts of practical interest.In particular, we have proven that these include local one-dimensional systems, one-dimensional systems with algebraically decaying interactions and systems with K-body interactions in any number of dimensions.For learning dynamical laws within this class, we have developed a new variant of the alternating least squares (ALS) algorithm for block-sparse TTs, which we refer to as ALS with gauge mediated weight sharing (ALS-GMWS), which is suitable for self-similar systems.The method has been successfully demonstrated on three physically relevant one-dimensional dynamical systems and robustness to Gaussian additive noise in the data has been demonstrated.

A Proof of Theorem 6
Theorem 6 bounds for χ > 1 the error that we obtain by representing the dynamical law of a system with (χ, g)algebraically decaying interactions and separation rank N by a low rank TT.The original system is governed by the dynamical law Suppose the approximation of this system The approximation error is bounded by where we use L 2 norm sub-additivity to get the first inequality and the last inequality comes from using the upper bound It remains to be shown that the truncated dynamical system corresponding to the dynamical law f ( L) is one-dimensional with interaction length L and separation rank N L. Since all of the terms in the sum eq.(54) depend trivially on x i with i / ∈ [k − L, k + L], the system has interaction length L by definition.Moreover, since it is a sum of L functions with separation rank at most N with respect to any bipartition P k , the total separation rank is bounded by N L. □

B Proof of Corollary 7
Corollary 7 shows that there exists ε-approximate low rank TT representation of a dynamical systems with (χ, g)algebraically decaying interactions and separation rank N , if χ > 1.To prove this, use Theorem 6 to find that there exists a one-dimensional dynamical system with interaction length L and separation rank N L, which is governed by a dynamical law f , such that, for all k ∈ This will certainly be satisfied if Rearranging, we obtain From Theorem 4 we know that in this approximation, each fk admits a TT representation with rank r ≥ N L, so □ C Proof of Theorem 9 Theorem 9 bounds the TT rank of the TT representation of the dynamical law f (x).In order to do this, we need to bound the separation rank with respect to bipartitions P k = ({x 1 , . . ., x k }, {x k+1 , . . ., k d }).To prove 1., first we need to show that To see this note that we require each J (k) ℓ to be distinct and to contain k.Hence, the upper bound is the number of ways of selecting the remaining K − 1 elements of J (k) ℓ .Now, rewrite the decomposition of the dynamical law of a K-mode interacting system with separation rank N as where Id(x) = 1.This is a sum of tensor products of single variable functions of each mode.Hence, the separation rank of f k with respect to any P k is upper bounded by the number of terms in the sum, which is Since the minimal ranks of the TT representation of f k are equal to the corresponding separation ranks, we get that The proof of 2. proceeds along similar lines.We write the total dynamical law f (x), which we think of in terms of eq. ( 30), as where e k ′ ∈ R d is the vector with one at the k ′ -th element and zeroes elsewhere.Here each gk ′ ,ℓ represents a function that depends non-trivially only on x i with i ∈ J and which has a separation rank at most N with respect to any bipartition of [d], due to the assumption that f (x) is a dynamical law of a K-body interacting system with separation rank N .Given a bipartition P k = (P left , P right ) = {{x 1 , . . ., x k }, {x k+1 , . . ., x d }}, we can write where the Id k1,k2 is the one function of x k1 , . . ., x k2 and we abuse the subset notation in J .Furthermore, we denote by g ′ k ′ ,ℓ , g ′′ k ′ ,ℓ the restriction of g k ′ ,ℓ onto the modes x k+1 , . . ., x d and x 1 , . . ., x k respectively.Note that this is well defined since we always use this notation in the cases where g depends trivially on the modes that we throw away.
Each term (71), (72), ( 73) is now written in such a way that we can read of a bound on its separation rank with respect to P k .The term (71) has separation rank with respect to P k at most 1.The term (72) vanishes if k ′ ≥ k + 1, since then the condition on the second sum cannot be satisfied.For each k ′ ≤ k, the number of terms in the sum, each of which has separation rank 1 with respect to P k is upper bounded by k−1 K−1 , so the bound on the separation rank of eq. ( 72) is k k−1 K−1 .Finally, in the last term (73), we know that each gk ′ ,ℓ has separation rank with respect to any bipartition bounded by N .The number of ℓ ∈ Putting all the bounds together, we recover claim 2. of the theorem.□

D Proof of Theorem 1
Theorem 1 states that functions with fixed degree given by the degree map w admit a TT representation, such that the left and right interface tensors satisfy the eigenvalue equations We have a TT ϕ with tensor cores {C i } i∈ [d] , such that it can be written in tensor network notation as Without loss of generality, we can assume that this TT is in left-canonical form, so that the tensor cores satisfy eq. ( 8), and that the ranks are minimal.If this is not the case, we can always find a gauge transformation that puts the TT into this form.We also assume that ϕ is an eigenvector of L, such that To prove the theorem, we will inductively gauge transform each tensor core, starting at C d and proceeding one-by-one towards C 1 , such that after transforming C ℓ , for all ℓ ′ ≥ ℓ − 1 the right interface tensor satisfies Finally, we will show that eq. ( 76) follows from eq. (75).
Base case.First, we will find an appropriate gauge transformation for C d .We can write the eigenvalue equation in the form Since ϕ <d † L <d ϕ <d is Hermitian, there exists a unitary U d and a diagonal matrix Λ <d with non-decreasing diagonal entries, such that ϕ <d † L <d ϕ <d = U † d Λ <d U d .Now we can write eq. ( 82) as where Cd = U d C d .Since U d is a unitary, it defines a gauge transformation which leaves ϕ invariant and preserves its left-canonical form.Rewriting eq. ( 83) as Cd Ω = (λ Id −Λ <d ) Cd and noticing that Ω = L >d−1 and Cd = ϕ >d−1 , we get ϕ >d−1 L >d−1 = Λ >d−1 ϕ >d−1 , (85) where we defined Λ >d−1 = λ Id −Λ <d , which has non-increasing entries.This is eq.( 75) for ℓ = d − 1.

Conclusion.
We have found a gauge transformation that puts the tensor train into a form such that eq. ( 75) is satisfied.
We will now show that this in fact implies eq. ( 76).For any ℓ ∈ [d − 1] we can write the eigenvalue equation as which, rearranging, can be written in matrix notation as which, since ϕ >ℓ has full row rank by the assumption of minimal ranks, implies which is eq.( 15).□

E Conditions on low-rankness
We here connect the conditions on low rank TT approximate representations of multivariate functions [BSU16] with entanglement conditions on low rank TT approximations of quantum states [SWVC08,ECP10], known in this context as matrix product states (MPS).

Theorem 4 (
Efficient TT decomposition of one-dimensional interacting systems, Theorem 5 in ref.[GGR + 20]).Suppose a one dimensional interacting system with interaction length L and separation rank N , governed by the dynamical law f (x) : S ⊂ R p d → R d .Then we have the following.1.Each f k (x) admits a TT representation with rank r ≤ N and 2. the function f admits a TT representation with ranks r k ≤ k − L + 1 + 2N L.

Figure 1 :
Figure 1: FPUT system.Recovery of a d = 50 system with constant spring constants (blue crosses) and spring constants κ, β drawn from the uniform distribution on [0, 2] and [0, 1.4] respectively (orange plus signs).Residuum after 10 ALS-GMWS sweeps is plotted against the size of the learning set.We use degree 3 Legendre polynomial dictionary, maximum block size ρ = 2 and selection table interaction length L = 5.

Figure 2 :
Figure 2: Magnetic dipole chain with trigonometric dictionary.Residuum after 8 ALS-GMWS sweeps on varying system sizes d as a function of the size of the learning set.The maximum block size is set to ρ = 3 and the selection table interaction length to (a) L = 5 and (b) L = 9.

Figure 3 :
Figure3: Magnetic dipole chain with Legendre polynomial dictionary.Residuum after 8 ALS-GMWS sweeps on system sizes d = 10, 20, 30 is plotted as a function of the learning set size.The maximum block size is set to ρ = 3, dictionary is truncated at degree 9 and selection table interaction length to L = 5.

Figure 4 :Figure 5 :
Figure 4: Magnetic dipole chain from noisy data.Residuum after 8 ALS-GMWS sweeps using noisy data with varying levels of noise σ for system sizes d = 20 (crosses) and d = 50 (plus signs).We use the trigonometric dictionary, maximum block size ρ = 3 and selection table interaction length L = 5.
(k ′ ) ℓ ⊂ P left,rightk and similar to indicate that x j ∈ P left,right k for all j ∈ J (k ′ ) ℓ where the double line combines multiple indices into a single edge and Ω = diag(w(1), . . ., w(d)).We can now contract the first d − 1 physical indices with ϕ <d * and use the assumption that the TT is written in left-canonical form which, rearranging, we can write in matrix notation asϕ <d † L <d ϕ <d C d = C d (λ Id −Ω).