Exact learning dynamics of deep linear networks with prior knowledge

Abstract Learning in deep neural networks is known to depend critically on the knowledge embedded in the initial network weights. However, few theoretical results have precisely linked prior knowledge to learning dynamics. Here we derive exact solutions to the dynamics of learning with rich prior knowledge in deep linear networks by generalising Fukumizu’s matrix Riccati solution (Fukumizu 1998 Gen 1 1E–03). We obtain explicit expressions for the evolving network function, hidden representational similarity, and neural tangent kernel over training for a broad class of initialisations and tasks. The expressions reveal a class of task-independent initialisations that radically alter learning dynamics from slow non-linear dynamics to fast exponential trajectories while converging to a global optimum with identical representational similarity, dissociating learning trajectories from the structure of initial internal representations. We characterise how network weights dynamically align with task structure, rigorously justifying why previous solutions successfully described learning from small initial weights without incorporating their fine-scale structure. Finally, we discuss the implications of these findings for continual learning, reversal learning and learning of structured knowledge. Taken together, our results provide a mathematical toolkit for understanding the impact of prior knowledge on deep learning.

Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
A hallmark of human learning is our exquisite sensitivity to prior knowledge: what we already know affects how we subsequently learn (Carey 1985).For instance, having learned about the attributes of nine animals, we may learn about the tenth more quickly (McClelland et al 1995, Murphy 2004, McClelland 2013, Flesch et al 2018).In machine learning, the impact of prior knowledge on learning is evident in a range of paradigms including reversal learning (Erdeniz and Atalay 2010), transfer learning (Taylor and Stone 2009, Thrun and Pratt 2012, Lampinen and Ganguli 2018, Gerace et al 2022), continual learning (Kirkpatrick et al 2017, Zenke et al 2017, Parisi et al 2019), curriculum learning (Bengio et al 2009), and meta learning (Javed and White 2019).One form of prior knowledge in deep networks is the initial network state, which is known to strongly impact learning dynamics (Saxe et al 2014, Pennington et al 2017, Bahri et al 2020).Even random initial weights of different variance can yield qualitative shifts in network behaviour between the lazy and rich regimes (Chizat et al 2019), imparting distinct inductive biases on the learning process.More broadly, rich representations such as those obtained through pretraining provide empirically fertile inductive biases for subsequent fine-tuning (Raghu et al 2019).Yet while the importance of prior knowledge to learning is clear, our theoretical understanding remains limited, and fundamental questions remain about the implicit inductive biases of neural networks trained from structured initial weights.A better understanding of the impact of initialisation on gradient-based learning may lead to improved pretraining schemes and illuminate pathologies like catastrophic forgetting in continual learning (McCloskey and Cohen 1989).
Here, we address this gap by deriving exact solutions to the dynamics of learning in deep linear networks as a function of network initialisation, revealing an intricate and systematic dependence.We consider the setting depicted in figure 1(A), where a network is trained with standard gradient descent from a potentially complex initialisation.When trained on the same task, different initialisations can radically change the network's learning trajectory (figures 1(B)-(D)).Our approach, based on a matrix Riccati formalism (Fukumizu 1998), provides explicit analytical expressions for the network output over time (figures 1(B)-(D) dotted).While simple, deep linear networks have a non-convex loss landscape and have been shown to recapitulate several features of nonlinear deep networks while retaining mathematical tractability.

Contributions
• We derive an explicit solution for the gradient flow of the network function, internal representational similarity, and finite-width neural tangent kernel (NTK) of over-and under-complete two-layer deep linear networks for a rich class of initial conditions (section 3).• We characterise a set of random initial network states that exhibit fast, exponential learning dynamics and yet converge to rich neural representations.Dissociating fast and slow learning dynamics from the rich and lazy learning regimes (section 4).• We analyse how weights dynamically align to task-relevant structure over the course of learning, going beyond prior work that has assumed initial alignment (section 5).• We provide exact solutions to continual learning dynamics, reversal learning dynamics and to the dynamics of learning and revising structured representations (section 6).

Related work
Our work builds on analyses of deep linear networks (Baldi and Hornik 1989, Fukumizu 1998, Saxe et al 2014, 2019, Lampinen and Ganguli 2018, Arora et al 2018a, Tarmoun et al 2021, Atanasov et al 2022), which have shown that this simple model nevertheless has intricate fixed point structure and nonlinear learning dynamics reminiscent of phenomena seen in nonlinear networks.A variety of works has analysed convergence (Arora et al 2018b, Du andHu 2019), generalisation (Lampinen and Ganguli 2018, Poggio et al 2018, Huh 2020), and the implicit bias of gradient descent (Gunasekar et al 2018, Ji and Telgarsky 2018, Laurent and Brecht 2018, Arora et al 2019a).These works mostly considers the tabula rasa case of small initial random weights, for which exact solutions are known (Saxe et al 2014).By contrast our formalism describes dynamics from a much larger class of initial conditions and can describe alignment dynamics that do not occur in the tabula rasa setting.Most directly, our results build from the matrix Riccati formulation proposed by Fukumizu (1998).Connecting this formulation and matrix factorisation problems yields a better characterisation of the convergence rate (Tarmoun et al 2021).We extend and refine the matrix Riccati result to obtain the dynamics of over-and under-complete networks; to obtain numerically stable forms of the matrix equations; and to more explicitly reveal the impact of initialisation.
A line of theoretical research has considered online learning dynamics in teacherstudent settings (Biehl and Schwarze 1995, Saad and Solla 1995, Goldt et al 2019), deriving ordinary differential equations for the average learning dynamics even in nonlinear networks.However, solving these equations requires numerical integration.By contrast, our approach provides explicit analytical solutions for the more restricted case of deep linear networks.
Other approaches for analysing deep network dynamics include the NTK (Jacot et al 2018, Lee et al 2019, Arora et al 2019b) and the mean field approach (Mei et al 2018, Rotskoff and Vanden-Eijnden 2018, Sirignano and Spiliopoulos 2020).While the former can describe nonlinear networks but not the learning dynamics of hidden representations, the later yields a description of representation learning dynamics in wide networks in terms of a partial differential equation.Our work is similar in seeking a subset of more tractable models that are amenable to analysis, but we focus on the impact of initialisation on representation learning dynamics and explicit solutions.
A large body of work has investigated the effect of different random initialisations on learning in deep networks.The role of initialisation in the vanishing gradient problem and proposals for better initialisation schemes have been illuminated by several works drawing on the central limit theorem (Glorot and Bengio 2010, Saxe et al 2014, He et al 2015, Pennington et al 2017, Xiao et al 2018), reviewed in Carleo et al (2019), Arora et al (2020), Bahri et al (2020).These approaches typically guarantee that gradients do not vanish at the start of learning, but do not analytically describe the resulting learning trajectories.Influential work has shown that network initialisation variance mediates a transition from rich representation learning to lazy NTK dynamics (Chizat et al 2019), which we analyse in our framework.

Preliminaries and setting
Consider a supervised learning task in which input vectors x n ∈ R N i from a set of P training pairs {(x n , y n )} n=1...P have to be associated with their target output vectors y n ∈ R No .We learn this task with a two-layer linear network model (figure 1(A)) that produces the output prediction with weight matrices W 1 ∈ R N h ×N i and W 2 ∈ R No×N h , where N h is the number of hidden units.The network's weights are optimised using full batch gradient descent with learning rate η (or respectively time constant τ = 1 η ) on the mean squared error loss where ⟨•⟩ denotes the average over the dataset.The input and input-output correlation matrices of the dataset are Finally, the gradient optimisation starts from an initialisation W 2 (0), W 1 (0).Our goal is to understand the full time trajectory of the network's output and internal representations as a function of this initialisation and the task statistics.Our starting point is the seminal work of Fukumizu (Fukumizu 1998), which showed that the gradient flow dynamics could be written as a matrix Riccati equation with known solution.In particular, defining if the following four assumptions hold: Assumption 2.1.The dimensions of the input and target vectors are identical, that is Assumption 2.2.The input data is whitened, that is Σxx = I.
Assumption 2.3.The network's weight matrices are zero-balanced at the beginning of training, that is For completeness, we include a derivation of this solution in appendix A.
Rather than tracking the weights' dynamics directly, this approach tracks several key statistics collected in the matrix which can be separated into four quadrants with intuitive meaning: the off-diagonal blocks contain the network function while the on-diagonal blocks contain the correlation structure of the weight matrices.These permit calculation of the temporal evolution of the network's internal representations including the task-relevant representational similarity matrices (RSMs) (Kriegeskorte et al 2008), i.e. the kernel matrix ϕ(x) T ϕ(x ′ ), of the neural representations in the hidden layer where + denotes the pseudoinverse; and the network's finite-width NTK (Jacot et al 2018, Lee et al 2019, Arora et al 2019b) where I is the identity matrix and ⊗ is the Kronecker product.For a derivation of these quantities see appendix B. Hence, the solution in equation ( 5) describes important aspects of network behaviour.
Exact learning dynamics of deep linear networks with prior knowledge J. Stat. Mech. (2023) 114004 However, in this form, the solution has several limitations.First, it relies on general matrix exponentials and inverses, which are a barrier to explicit understanding.Second, when evaluated numerically, it is often unstable.And third, the equation is only valid for equal input and output dimensions.In the following section we address these limitations.Implementation and simulation.Simulation details are in appendix H. Code to replicate all simulations and plots are available online6 under a GPLv3 license and requires <6 h to execute on a single AMD Ryzen 5950x.

Exact learning dynamics with prior knowledge
In this section we derive an exact and numerically stable solution for QQ T that better reveals the learning dynamics, convergence behaviour and generalisation properties of two-layer linear networks with prior knowledge.Further, we alter the equations to be applicable to equal and unequal input and output dimensions, overcoming assumption 2.1.
To place the solution in a more explicit form, we make use of the compact singular value decomposition.Let the compact singular value decomposition of the initial network function and the input-output correlation of the task be Here, U and Ũ ∈ R No×Nm denote the left singular vectors, S and S ∈ R Nm×Nm the square matrix with ordered, non-zero eigenvalues on its diagonal and V and Ṽ ∈ R N i ×Nm the corresponding right singular vectors.For unequal input-output dimensions (N i ̸ = N o ) the right and left singular vectors are therefore not generally square and orthonormal.Accordingly, for the case N i > N o , we define Ũ⊥ ∈ R No×(No−N i ) as a matrix containing orthogonal column vectors that complete the basis, i.e. make Ũ Ũ⊥ orthonormal.Conversely, we define Ṽ⊥ ∈ R  with For a proof of theorem 3.1 please refer to appendix C. With this solution we can calculate the exact temporal dynamics of the loss, network function, RSMs and NTK (figures 2(A) and (B)).As the solution contains only negative exponentials, it is numerically stable and provides high precision across a wide range of learning rates and network architectures (figures 2(C) and (D)).
We note that a solution for the weights W 1 (t) and W 2 (t), i.e.Q(t), can be derived up to a time varying orthogonal transformation as demonstrated in appendix C. Further, as time-dependent variables only occur in matrix exponentials of diagonal matrices of negative sign, the network approaches a steady state solution.
Theorem 3.2.Under the assumptions of theorem 3.1, the network function converges to the global minimum ŨS ṼT and acquires a rich task-specific internal representation, that is W T 1 W 1 = ṼS ṼT and W 2 W T 2 = ŨS ŨT .The proof of theorem 3.2 is in appendix C. We now turn to several implications of these results.

Rich and lazy learning regimes and generalisation
Recent results have shown that large deep networks can operate in qualitatively distinct regimes that depend on their weight initialisations (Chizat et al 2019, Flesch et al 2022), the so called rich and lazy regimes.In the rich regime, learning dynamics can be highly nonlinear and lead to task-specific solutions thought to lead to favourable generalisation properties (Chizat et al 2019, Saxe et al 2019, Flesch et al 2022).By contrast, the lazy regime exhibits simple exponential learning dynamics and exploits high-dimensional nonlinear projections of the data produced by the initial random weights, leading to task-agnostic representations that attain zero training error but possibly lower generalisation performance (Jacot et al 2018, Lee et al 2019, Arora et al 2019b).Traditionally, the rich and lazy learning regimes have been respectively linked to low and high variance initial weights (relative to the network layer size).
To illustrate these phenomena, we consider a semantic learning task in which a set of living things have to be linked to their position in a hierarchical structure (figure 3(A)) (Saxe et al 2014).The representational similarity of the input of the task ( ṼS ṼT ) reveals its inherent structure (figure 3(B)).For example, the representations of the two fishes are most similar to each other, less similar to birds and least similar to plants.Likewise, the representational similarity of the task's target values ( ŨS ŨT ) reveals the primary groups among which items are organised.As a consequence, one can for example predict from an object being a fish that it is an animal and from an object being a plant that it is not a bird.Reflecting these structural relationships in internal representations can allow the rich regime to generalise in ways the lazy regime cannot.Crucially, QQ T (t) contains the temporal dynamics of the weights' representational similarity and therefore can be used to study if a network finds a rich or lazy solution.
When training a two layer network from random small initial weights, the weights' input and output RSM (figure 3(C), upper left and lower right quadrant) are identical to the task's structure at convergence.However, when training from large initial weights, the RSM reveals that the network has converged to a lazy solution (figure 3(D)).We emphasise that the network function in both cases is identical (figures 3(C) and (D), lower left quadrant).And while their final loss is identical too, their learning dynamics evolve slow and step-wise in the case of small initial weights and fast and exponentially in the case of large initial weights (figure 3(F)), as predicted by previous work (Chizat et al 2019).
However, from theorem 3.2 it directly follows that our setup is guaranteed to find a rich solution in which the weights' RSM is identical to the task's RSM, i.e.

W T
1 W 1 = ṼS ṼT and W 2 W T 2 = ŨS ŨT .Therefore, as zero-balanced weights may be large, there exist initial states that converge to rich solutions while evolving as rapid exponential learning curves (figures 3(E) and (F)).Crucially, these initialisations are task-agnostic, in the sense that they are independent of the task structure (see Mishkin and Matas 2015).This finding applies to any learning task with well defined inputoutput correlation.For additional simulations see appendix D. Hence our equation can describe the change in dynamics from step-like to exponential with increasing weight scale, and separate this dynamical phenomenon from the structure of internal representations.

Decoupling dynamics
The learning dynamics of deep linear networks depend on the exact initial values of the synaptic weights.Previous solutions studied learning dynamics under the assumption that initial network weights are 'decoupled', such that the initial state of the network and the task share the same singular vectors, i.e. that U = Ũ and V = Ṽ (Saxe et al 2014).Intuitively, this assumption means that there is no cross-coupling between different singular modes, such that each evolves independently.However, this assumption is violated in most real-world scenarios.As a consequence, most prior work has relied on the empirical observation that learning from tabula rasa small initial weights occurs in two phases: First, the network's input-output map rapidly decouples; then subsequently, independent singular modes are learned in this decoupled regime.Because this decoupling process is fast when training begins from small initial weights, the learning dynamics are still approximately described by the temporal learning dynamics of the singular values assuming decoupling from the start.This dynamic has been called a silent alignment process (Atanasov et al 2022).Here we leverage our matrix Riccati approach to analytically study the dynamics of this decoupling process.We begin by deriving an alternate form of the exact solution that eases the analysis.
Theorem 5.1.Let the weight matrices of a two layer linear network be initialised by W 1 = A(0) ṼT and W 2 = ŨA(0) T , where A(0) ∈ R N h ×N i is an arbitrary, invertible matrix.Then, under the assumptions of equal input-output dimensions 2.1, whitened inputs 2.2, zero-balanced weights 2.3 and full rank 2.4, the temporal dynamics of QQ T are fully determined by For a proof of theorem 5.1, please refer to appendix E. We remark that this form is less general than that in theorem 3.1, and in particular implies UV = Ũ Ṽ.Here the matrix A T A represents the dynamics directly in the SVD basis of the task.Off-diagonal elements represent counterproductive coupling between different singular modes (for instance, [A T A] 21 is the strength of connection from input singular vector 1 to output singular vector 2, which must approach zero to perform the task perfectly), while ondiagonal elements represent the coupling within the same mode (for instance, [A T A] 11 is the strength of connection from input singular vector 1 to output singular vector 1, which must approach the associated task singular value to perform the task perfectly).Hence the decoupling process can be studied by examining the dynamics by which A T A becomes approximately diagonal.
The outer inverse in equation ( 13) renders it difficult to study high dimensional networks analytically.Therefore, we focus on small networks with input and output dimension N i = 2 and N o = 2, for which a lengthy but explicit analytical solution is given in appendix E. In this setting, the structure of the weight initialisation and task are encoded in the matrices where the parameters a 1 (0) and a 2 (0) represent the component of the initialisation that is aligned with the task, and b(0) represents cross-coupling, such that taking b(0) = 0 recovers previously known and more restricted solutions for the decoupled case (Saxe et al 2014).We use this setting to demonstrate two features of the learning dynamics.
Decoupling dynamics.First, we track decoupling by considering the dynamics of the off-diagonal element b(t) (figures 4(D)-(F) red lines).At convergence, the off-diagonal element shrinks to zero as shown in appendix E. However, strikingly, b(t) can exhibit non-monotonic trajectories with transient peaks or valleys partway through the learning process.In particular, in appendix E we derive the time of the peak magnitude as 0) 2 (figure 4(F) green dotted line), which coincides approximately with the time at which the on-diagonal element is half learned.If initialised from small random weights, the off-diagonal remains near-zero throughout learning, reminiscent of the silent alignment effect (Atanasov et al 2022).For large initialisations, no peak is observed and the dynamics are exponential.At intermediate initialisations, the maximum of the offdiagonal is reached before the singular mode is fully learned (appendix E).Intuitively, a particular input singular vector can initially project appreciably onto the wrong output singular vector, corresponding to initial misalignment.This is only revealed when this link is amplified, at which point corrective dynamics remove the counter-productive coupling, as schematised in figure 4(B).We report further measurements of decoupling in appendix E.
Effect of initialisation variance.Next, we revisit the impact of initialisation scale for the on-diagonal dynamics.As shown in figures 4(D)-(F), as the initialisation variance grows the learning dynamics change from sigmoidal to exponential, possibly displaying more complex behaviour at intermediate variance (appendix E).In this simple setting we can analyse this transition in detail.Taking s 1 = s 2 = s as in figure 4(F) and while for |a 1 (0)|, |a 2 (0)|, |b(0)| ≫ 0 the dynamics of the on-diagonal element a 1 is close to exponential (figures 4(D)-(F) left and right columns).We examine larger networks in appendix E.

Applications
The solutions derived in sections 3 and 5 provide tools to examine the impact of prior knowledge on dynamics in deep linear networks.So far we have traced general features of the behaviour of these solutions.In this section, we use this toolkit to develop accounts of several specific phenomena.
Continual Learning.Continual learning (see Parisi et al 2019 for a review) and the pathology of catastrophic forgetting have long been a challenge for neural network models (McCloskey and Cohen 1989, Ratcliff 1990, French 1999).A variety of theoretical work has investigated aspects of continual learning (Tripuraneni et al 2020, Asanuma Training on later tasks can overwrite previously learned knowledge, a phenomenon known as catastrophic forgetting (McCloskey and Cohen 1989, Ratcliff 1990, French 1999).From theorem 3.2 it follows that from any arbitrary zero-balanced initialisation 2.3, the network converges to the global optimum such that the initialisation is completely overwritten and forgetting is truly catastrophic.In particular, the loss of any other task T i after training to convergence on task , where c is a constant that only depends on training data of task T i (appendix F).As a consequence, the amount of forgetting, i.e. the relative change of loss, is fully determined by the similarity structure of the tasks and thus can be fully determined for a sequence of tasks before the onset of training (figures 5(B) and (E), appendix F).For example, the amount of catastrophic forgetting in task T a , when training on task T c after having trained the network on task T b is L a (T c ) − L a (T b ).As expected, our results depend on our linear setting and tanh or ReLU nonlinearities can show different behaviour, typically increasing the amount of forgetting (figures 5(B), (D) and (E)).Further, in nonlinear networks, weights become rapidly unbalanced and forgetting values that are calculated before the onset of training do not predict the actual outcome (figures 5(B)-(E)).In summary, our results link exact learning dynamics with catastrophic forgetting and thus provide an analytical tool to study the mechanisms and potential counter measures underlying catastrophic forgetting.
Reversal learning.During reversal learning, pre-existing knowledge has to be relearned, overcoming a previously learned relationship between inputs and outputs.For example, reversal learning occurs when items of a class are mislabelled and later corrected.We show analytically, that reversal learning in fact does not succeed in deep linear networks (appendix G).The pre-existing knowledge lies exactly on the separatrix of a saddle point causing the learning dynamics to converge to zero (figure 6(A)).In contrast, the learning still succeeds numerically, as any noise will perturb the dynamics off the saddle point, allowing learning to proceed (figure 6(A)).However, the dynamics still slow in the vicinity of the saddle point, providing a theoretical explanation for catastrophic slowing in deep linear networks (Lee et al 2022).We note that the analytical solution requires an adaptation of theorem 3.1, as B is generally not invertible in the case of reversal learning (appendix G).Further, as is revealed by the exact learning dynamics (appendix G), shallow networks do succeed without exhibiting catastrophic slowing during reversal learning (figure 6(B)).
Revising structured knowledge.Knowledge is often organised within an underlying, shared structure, of which many can be learned and represented in deep linear networks (Saxe et al 2019).For example, spatial locations can be related to each other using the same cardinal directions, or varying semantic knowledge can be organised using the same hierarchical tree.Here, we investigate if deep linear networks benefit from shared underlying structure.To this end, a network is first trained on the three-level hierarchical tree of section 4 (eight items of the living kingdom, each with a set of eight associated features), and subsequently trained on a revised version of the hierarchy.The revised task varies the relation of inputs and outputs while keeping the same underlying tree structure.If the revision involves swapping two neighbouring nodes on any level of the hierarchy, e.g. the identity of the two fish on the lowest level of the hierarchy (figure 6(C), top), the task is identical to reversal learning, leading to catastrophically slowed dynamics (figures 6(E) and (F), top).When training the network on a new hierarchical tree with identical items but a new set of features, like a colour hierarchy (figure 6(C), bottom), there is no speed advantage in comparison to a random initialisation with similar initial variance (figures 6(E) and (F), bottom).Importantly, from theorem 3.2 it follows, that the learning process can be sped up significantly by initialising from large zero-balanced weights, while converging to a global minimum with identical generalisation properties as when training from small weights (figures 6(G) and (H).In summary, having incorporated structured knowledge before revision does not speed up or even slows down learning in comparison to learning from random zerobalanced weights.Notably, that is despite the tasks' structure being almost identical (figures 3(B) and 6(D).

Discussion
We derive exact solutions to the dynamics of learning with rich prior knowledge in a tractable model class: deep linear networks.While our results broaden the class of twolayer linear network problems that can be described analytically, they remain limited and rely on a set of assumptions (2.1)-(2.4).In particular, weakening the requirement that the input covariance be white and the weights be zero-balanced would enable analysis of the impact of initialisation on internal representations.Nevertheless, these solutions reveal several insights into network behaviour.We show that there exists a large set of initial values, namely zero-balanced weights 2.3, which lead to task-specific representations; and that large initialisations lead to exponential rather than sigmoidal learning curves.We hope our results provide a mathematical toolkit that illuminates the complex impact of prior knowledge on deep learning dynamics.

Appendix A. Fukumizu approach
For completeness, we reproduce the derivation from Fukumizu (1998) of equation ( 5).We consider the learning setting describe in section 2. Under the assumptions of equal input-output dimensions 2.1, whitened inputs 2.2 and zero-balanced weights 2.3, the weights dynamics yield Under the assumption of whitened inputs 2.2, the dynamics simplify to We introduce the variables We compute the time derivative Using equations ( 18) and ( 19) we compute the four quadrant separately giving where we have used the assumption of zero-balanced weights 2.3 to simplify equations ( 25) and (37).Defining https://doi.org/10.1088/1742-5468/ad01b8 the gradient flow dynamics of QQ T (t) can be written as a differential matrix Riccati equation We write τ d dt (QQ T ) for completeness
Assuming that Q(0) is full rank, the continuous differential equation ( 39) has a unique solution for all t ⩾ 0

Appendix B. Network's internal representations B.1. Representational similarity analysis
The task-relevant representational similarity matrix (Kriegeskorte et al 2008) of the hidden layer, calculated from the inputs H = W 1 X is Similarly, the representational similarity matrix of the hidden layer, calculated from the outputs H = W + 2 Y , where + denotes the pseudoinverse, is

B.2. Finite-width neural tangent kernel
In the following, we derive the finite-width neural tangent kernel (Jacot et al 2018) for a two-layer linear network.Starting with the network function at time t https://doi.org/10.1088/1742-5468/ad01b8 the discrete time gradient descent dynamics of the next time step yields The network function's gradient flow can then be derived as Substituting the partial derivatives Finally, we introduce the identity matrix I No of size N o and apply row-wise vectoriasation vec r (F (X)) := f (X) and the identity vec r (ABC) = (A ⊗ C T )vec r (B) to derive the neural tangent kernel In the following, we prove that equation ( 11) is in fact a solution to the matrix Riccati equation arising from gradient flow (equation ( 39)).We prove the theorem by directly substituting our solution for QQ T (t) into the matrix Riccati equation.

C.1.1. Unequal input-output dimension.
We start with the following equation which is identical to equation (11) in the main text, as we verify in section C.2 (by reversing the derivation from equation (152) to equation ( 128)).Substituting our solution into the matrix Riccati equation then yields Next, we note that https://doi.org/10.1088/1742-5468/ad01b8and Then, using the chain rule ∂(AB) = (∂A)B + A(∂B) and the identities we get with Finally, substituting equations ( 82), ( 86) and ( 90) into the left hand side of equation ( 70) proves equality.□

C.2. Derivation of the exact learning dynamics
In the following, we outline how the solution to the matrix Ricatti equation can be acquired.Let the input and output dimension of a two-layer linear network (equation ( 1)) be denoted by N i and N o respectively.Further, let N m = min(N i , N o ) denote the smaller one of the two.The compact singular value decomposition of the initial network function and the input-output correlation of the task is then Here, U and Ũ ∈ R No×Nm denote the left singular vectors, S and S ∈ R Nm×Nm the square matrix with ordered, non-zero eigenvalues on its diagonal and V and Ṽ ∈ R N i ×Nm the corresponding right singular vectors.Please note that when using compact singular value decomposition, in the case of unequal input-output dimensions (N i ̸ = N o ) the right and left singular vectors are not generally square and orthonormal.More specifically, in the case of In this case, we use Ũ⊥ ∈ R No×(No−N i ) to denote the matrix that contains orthogonal column vectors such that the concatenation Ũ Ũ⊥ is orthonormal and Ṽ⊥ ∈ R N i ×(No−N i ) to denote a matrix of zeros.
Conversely, in the case of No) such that Ṽ Ṽ⊥ is orthonormal and Ũ⊥ ∈ R No×(No−N i ) to denote a matrix of zeros.

C.2.1. Inverse and matrix exponential of F.
The solution to the matrix Riccati equation as provided by Fukumizu (1998) requires calculation of the inverse F −1 and the matrix exponential e F t τ .To this end, we diagonalise F by completing its basis by incorporating zero eigenvalues as illustrated below Note that P T P = PP T = I and therefore P T = P −1 .We then use the diagonalisation of F to rewrite the matrix exponential As the inverse F −1 = PΓ −1 P T is not well defined for a Γ with zero eigenvalues.We study eigenvalues of value zero by analysing the limiting behaviour of for a single mode which reveals the time dependent contribution of zero eigenvalues.Thus We continue by substituting the above results into Fukumizu's equation

J. Stat. Mech. (2023) 114004
Then, matrix multiplication on the left side of the equation yields and such that We continue by calculating https://doi.org/10.1088/1742-5468/ad01b8 and Next, we define B = U T Ũ + V T Ṽ and C = U T Ũ − V T Ṽ and rewrite the inverse as Working from the centre out, we have and Finally, using AB −1 = (BA −1 ) −1 (and A −1 B = (B −1 A) −1 ) to move terms into the inverse, we rewrite

C.3. Proof of theorem 3.2: Limiting behaviour
As training time increases, all terms including a matrix exponential with negative exponent in equation ( 11) vanish to zero, as S is a diagonal matrix with entries larger zero Therefore, in the temporal limit, equation ( 11) reduces to The solution for the weights W 1 (t) and W 2 (t) can be derived up to a time varying orthogonal transformation as demonstrated by Yan et al (1994).
Under the assumptions of whitened inputs 2.2, zero-balanced weights 2.3, full rank 2.4, and equal input-output dimension, the temporal dynamics of Q(t) is given as where D(t) is an orthogonal matrix of size N h × N h .From this definition, computing Q(t)Q(t) T , we recover equation (45).Equation ( 157) shows that the individual weight matrices are not directly described by parts of the Q(t)Q(t) T solution.Instead, they are fixed only up to a timedependent orthogonal transformation.To verify this, we numerically compute D(t) as D(t) = q(t) + Q sim (t) where Q sim (t) denotes weights obtained from numerical simulations of gradient descent, + denotes the pseudoinverse ( q + (t) = (q T (t)q(t)) −1 q(t) T  where q(t) is rectangular) and We numerically show in figure 7(D) right panel that D(t) generally changes over time.Letting Q d (t) denote the estimated Q(t) using the numerically recovered D(t), figure 7(D) left and centre panels show that both the dynamics of Q d (t) and Q d (t)Q d (t) T match the temporal dynamics of the simulation.The small derivation between the Figure 7. (A) Loss under gradient descent learning two random input-output correlation task with learning rate η = 0, 001 up to precision 1 × 10 −7 .The green dotted line marks the time at which the target is switched from task 1 to task 2. (B) Numerical (coloured line) and analytical (black dotted line) temporal dynamics of QQ T (t) as given by equation ( 159).(C) Numerical (coloured line) and analytical (black dotted line) temporal dynamics of q(t) and q(t)q(t) T (158) (D) Temporal dynamics of D(t).Numerical (coloured line) and analytical (black dotted line) temporal dynamics of Q d (t)Q d (t) T and Q d (t) as given by equation ( 157) where (D) was computed numerically.simulation and the analytical solution for later time points, is due to the imprecision of the pseudoinverse.
In figure 7(C), we report the implementation of equation ( 158).As expected, the analytical solution does not match the numerical temporal dynamics.However, the solution for q(t)q(t) T recovers the correct dynamics.B) Mean and standard deviation on the error on the internal representation error defined as in section D for the learning the living kingdom task (figure 6(A)), a random 7 × 7 matrix (blue), a random 5 × 7 matrix (yellow), a 7 × 5 matrix (green), a 8 × 8 matrix (red).All the task ran were ran with learning rate η = 0.001 enforcing initial zero-balanced weights 2.3 (dotted line) and breaking the assumption of zero-balanced initial weights 2.3 (line).N h = 10 for all networks.
Appendix E. Decoupling dynamics E.1.Proof for theorem 5.1 Let the input and output dimension of a two-layer linear network (equation (1)) be equal, i.e.N i = N o , then equation (11) simplifies to Further, let the singular value decomposition of the input-output correlation of the task be and suppose that the initial state of the network can be written in the form

J. Stat. Mech. (2023) 114004
Substituting the new values of B and C into equation ( 159) then yields Finally, we note that the dynamics can thus be written as where We consider small networks with input and output dimension N i = 2 and N o = 2.In this setting, the structure of the weight initialisation and task are encoded in the matrices where the parameters a 1 (0) and a 2 (0) represent coupling within a singular mode, and b(0) represents counterproductive cross-coupling between different singular modes.From equation ( 13), we have https://doi.org/10.1088/1742-5468/ad01b8 where we use We continue with We use equation ( 189) and simplify the denominator The diagonal element a 1 (t) is given as (194) and interchanging subscripts 1 and 2 yields a 2 (t).As a check on this result, by setting b(0) = 0 we recover the expression from Saxe et al (2019).
We further simplify the denominator to

E.3. Off-Diagonal decoupling dynamics
We track the decoupling by considering the dynamics of the off-diagonal element b(t).
As t tends to infinity lim t→∞ b(t) = 0 the off-diagonal element shrinks to zero.We can further simplify the off-diagonal to Equation ( 198) can exhibit non-monotonic trajectories with transient peaks as shown in figure 4. The qualitative observations for the 2 × 2 network hold for larger target matrices as shown in figure 9.For large initialisation, the dynamics are exponential.
At intermediate and small initialisation, the maximum of the off-diagonal is reached before the singular mode is fully learned.In the small initialisation scheme, the peak is of negligible size.The respective target matrix for panel  5 6 3 0 1  4, 1 0 1 2  3 0 2 4 0  3 4 0 3 2  2 0 1 3 5 0 0 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 5 We characterise these dynamics considering the case where s 1 = s 2 = s for the twoby-two solution (i.e.equal diagonal target y) for which we can compute the time of the peak.In this particular case, we can further simplify the off-diagonal to We find the time of the maximum of the off-diagonal elements to be t peak = τ 4s ln s(s−a 1 (0)−a 2 (0)) a 1 (0)a 2 (0)−b(0) 2 .The presence of a peak in the off-diagonal values, indicates the decoupling, but as shown in figures 4(D)-(F), the peak size is negligible in comparison to the size of the on-diagonal values for small initial weights.This difference is reminiscent of the silent alignment effect described by Atanasov et al (2022).We further note, that the time scale of decoupling is on the same order as the one reported for the silent alignment effect t sa = 1 s .

E.4. On-diagonal dynamics and the effect of initialisation variance
In this section we revisit the impact of initialisation scale for the on-diagonal dynamics.
The observation for 2 × 2 network hold for larger target matrices as shown in figure 9.For large variance initialisations, the dynamics are exponential.At intermediate variance initialisations, we observe more complex behaviour.While at small variance initialisations, the on-diagonal element describes a sigmoidal trajectory.

Appendix F. Continual learning
We consider the case of training a two-layer deep linear network on a sequence of tasks T a , T b , T c , ... with corresponding correlation functions T a = Σyx a , T b = Σyx b ....Then, the full batch loss of the i th task at any point in training time is From theorem 3.2 it follows that after training the network to convergence on task T j , the network function is W 2 W 1 = ŨS ṼT = Σyx j .Further, using the assumption of whitened inputs 2.2 and the identities ||A|| F 2 = Tr(AA T ) and Tr(A) + Tr(B) = Tr(A + B), the full batch loss of the i -th task is then Therefore, the amount of forgetting F on task T i when training on task T k after having trained the network on task T j , i.e. the relative change of loss, is fully determined by the similarity structure of the tasks vectors are interchangeable.In the reversal learning setting, both B = U T Ũ + V T Ṽ and C = U T Ũ − V T Ṽ are diagonal matrices.The diagonal entries of C are zero if the singular vectors are aligned and 2 if they are reversed.Similarly, diagonal entries of B are 2 if the singular vectors are aligned and zero if they are reversed.Therefore, in the case of reversal learning, B is a diagonal matrix with 0 values and thus is not invertible.As a consequence, the learning dynamics cannot be described by equation ( 11).However, as B and C are diagonal matrices, the learning dynamics simplify.Let b i , c i , s i and si denote the i -th diagonal entry of B, C, S and S respectively, then the network dynamics can be rewritten as It follows, that in the reversal learning case, i.e. b = 0, for each reversed singular vector, the dynamics vanish to zero Analytically, the learning dynamics are initialised and remain on the separatrix of a saddle point, until the corresponding singular value of the network function has vanished and remains zero, corresponding to convergence to the saddle point.When simulated numerically, the learning dynamics escape the saddle points due to imprecision of floating point arithmetic.However, numerical optimisation still suffers from catastrophic slowing (Lee et al 2022), as escaping the saddle point takes time (figure 6(A)).In contrast, in the case of aligned singular vectors (c = 0), we recover the equation for the temporal dynamics as described in Saxe et al (2014).Training succeeds, as the singular value of the network function converges to its target value

J. Stat. Mech. (2023) 114004
In summary, in the case of aligned singular vectors, the learning dynamics can be described by the convergence of singular values.However in the case of reversal learning, analytically, training does not succeed.In simulations, the learning dynamics escape the saddle point due to numerical imprecision, but the learning dynamics are catastrophically slowed in the vicinity of the saddle point.

G.2. Exact learning dynamics in shallow networks
To provide a point of comparison to our deep linear network results, here we derive a solution for the temporal dynamics of reversal learning in a shallow network.The network's weights are optimised using full batch gradient descent with learning rate η (or equivalently time constant τ = 1/η) on the mean squared error loss given in equation ( 2), yielding the first task dynamics where Σxx and Σyx is the input and input-output correlation matrices of the dataset.We define motivating the change of variable W = UW V T .We project the weight into the basis of the initialisation Under the assumption of whitened inputs 2.2, the dynamics yields Defining W ii = b i the diagonal element of the matrix, encoding the strength of the mode i transmitted by the input-to-output weight.Similarly, we write (U T Σyx V) ii = k i .Assuming decoupled initial conditions, we obtain the scalar dynamics https://doi.org/10.1088/1742-5468/ad01b8

H.2. Tasks
In the following, we describe the different tasks that are used throughout the simulation studies.
H.2.1.Random regression task.In a random regression task the inputs X ∈ R N i ,N are sampled from a random normal distribution X ∼ N (µ = 0, σ = 1).The input data X is then whitened, such that 1 N XX T = I.The target values Y ∈ No,N are also sampled from a random normal distribution, however, with variance adjusted to the number of output nodes Y ∼ N (µ = 0, α = 1 √ No ).Thus, network inputs and target values are uncorrelated Gaussian noise and therefore, a linear solution does not always exist.
H.2.2.Teacher-student task.In order to guarantee that a linear solution exists, we use the teacher-student setup.First, inputs X are sampled as in the random regression task.Then, target values Y are generated by sampling a pair of random zero-balanced weights W 1 ∈ R N h ×N i and W 2 ∈ R No×N h and then calculating Y = W 2 W 1 X.Like this, it is ensured that a linear solution exists.The variance of the output is varied by changing the variation within the zero-balanced weights σ.
H.2.3.Semantic hierarchy.Input items in the semantic hierarchy task are encoded as one-hot vectors, i.e.X = I.The corresponding target vectors y i encoded the position in the hierarchical tree.Where a 1 encoded being a left child of a node, a −1 encoded being a right child of a node and a 0 encoded that the item is not a child of that node.For example, the blue fish is a blue fish, it is a left child of the root node, a left child of the animal node, not part of the plant branch, a right child of the fish node, and not part of the bird, algae or flower branch, leading to the label [1, 1, 1, 0, −1, 0, 0, 0].The labels for all objects in the semantic tree as depicted in figure 3(A) is then The singular value decomposition for the corresponding correlation matrix Σyx are not unique.The first two, the third and the fourth and the last four singular values are identical.In order to match the numerical and analytical solution, this permutation invariance is removed by adding a small constant perturbation to each column y i , i ∈ 1, . .., N of the labels leading to almost but not exactly identical singular values. https://doi.org/10.1088/1742-5468/ad01b8 H.2.4.Colour hierarchy.Following the same procedure as described for the semantic hierarchy, the labels for the colour hierarchy as depicted in figure 6(C) are then ].The network was then trained until convergence on the same task from the same initial weights for seven different learning rates η ∈ {0.05, 0.0232, 0.0107, 0.005, 0.0023, 0.0011, 0.0005}.  .The learning rate was η = 0.075.
For panel (D) and (E) the same simulation was repeated for three times, the first time with a linear, the second time with a tanh and the last time with a ReLU activation function.Each time, five random regression tasks with dimensions N i = 15, N h = 18, N o = 21 and N = 50 were generated.Then a network with initial weight scale α = 0.025 was sequentially trained with learning rate η = 0.1 on the five random regression tasks.
H.8. Figure 6 Figure 6 panel (A) was generated by training a linear network with N i = 4, N h = 6, N o = 4 on a reversal learning task (see section G.1), which was derived from a random regression task.The learning rate was η = 0.05 and initial weights had a standard deviation of σ = 0.25.Panel (B) was generated by training a shallow linear network (see section G.2) on the same reversal learning task, with identical hyperparameters as in panel (A).
For the top and bottom rows of panels (E) and (F) a linear network with N i = 8, N h = 14, N o = 8 was trained on the semantic hierarchy task, followed by training the network on the adapted semantic hierarchy as depicted in figure 6(C) top, which is a reversal learning task and the colour hierarchy respectively.The learning rate was η = 0.05 and σ was set to 0.001 and 0.35 respectively.

Figure 1 .
Figure 1.Learning with prior knowledge.(A) In our setting, a deep linear network with N i input, N h hidden and N o output neurons is trained from a particular initialisation using gradient descent.(B)-(D) Network output for an example task over training time when starting from (B) small random weights, (C) large random weights, and (D) the weights of a previously learned task.The dynamics depend in detail on the initialisation.Solid lines indicate simulations, dotted lines indicate the analytical solutions we derive in this work.

Figure 2 .
Figure 2. Exact learning dynamics (A) the temporal dynamics of the numerical simulation (coloured lines) of the loss, network function, correlation of input and output weights and the NTK (columns 1-5 respectively) are exactly matched by the analytical solution (black dotted lines) for small initial weight values and (B) large initial weight values.(C) Each line shows the deviation of the analytical loss L from the numerical loss L for one of n = 50 networks with random architecture and training data (details in appendix H) across a range of learning rates η ∈ [0.05, 0.0005].The deviation mutually decreases with the learning rate.(D) Numerical and analytical learning curves for five randomly sampled example networks (coloured x in (C)).

Figure 3 .
Figure 3. Rich and lazy learning.(A) Semantic learning task, (B) SVD of the input-output correlation of the task (top) and the respective RSMs (bottom).Rows and columns in the SVD and RSMs are identically ordered as the order of items in the hierarchical tree.(C) Final QQ T matrices after training converged when initialised from random small weights, (D) random large weights (note how the upper left and lower right quadrant differ from the task's RSMs) and (E) large zero-balanced weights.(F) Learning curves for the three different initialisations as in (C) (green), (D) (pink) and (E) (blue).While both large weight initialisations lead to fast exponential learning curves, the small weight initialisation leads to a slow step-like decay of the loss.

Figure 4 .
Figure 4. Decoupling dynamics.(A) Analytical (black dotted lines) and numerical (solid lines) of the temporal dynamics of the on-and off-diagonal elements of A T A in blue and red, respectively.(B) Schematic representation of the decoupling process.(C) Three target matrices with dense, unequal diagonal, and equal diagonal structure.(D)and (F) Decoupling dynamics for the top (D), middle (E), and bottom (F) tasks depicted in panel (C).Row F contains analytical predictions for the time of the peak of the off-diagonal (dashed green).The network is initialised as defined in appendix E with small, intermediate and large variance.

Figure 5 .
Figure 5. Continual learning.(A) Top: network training from small zero-balanced weights on a sequence of tasks (coloured lines show simulation and black dotted lines analytical results).Bottom: evaluation loss for tasks of the sequence (dotted) while training on the current task (solid).As the network function is optimised on the current task, the loss of other tasks increases.(B) Comparison of the numerical and analytical amount of catastrophic forgetting on a first task after training on a second task for n = 50 linear (red), tanh (blue) and ReLU (green) networks.(C) Weight alignment before and after training on a sequence of two tasks for n = 50 networks in linear (red), tanh (blue) and ReLU (green) networks.Shaded area shows ±std.(D) Evaluation loss for each of 5 tasks during training a linear (red), tanh (blue) and ReLU (green) network.(E) Same data es in (D) but evaluated as relative change (i.e.amount of catastrophic forgetting).The top half of each square shows the pre-computed analytical amount of forgetting and the bottom half the numerical value.

Figure 6 .
Figure 6.Reversal learning and revising structured knowledge.Scale of x -axis varies in top and bottom rows.(A) Analytical (black dotted) and numerical (solid) learning dynamics of a reversal learning task.The analytical solution gets stuck on a saddle point, whereas the numerical simulation escapes the saddle point and converges to the target.(B) In a shallow network, training on the same task as in A converges analytically (black dotted) and numerically (solid).(C) Semantic learning tasks.Revised living kingdom (top) and colour hierarchy (bottom).(D) SVD of the input-output correlation of the tasks and respective RSMs.(E) Analytical (black dotted) and simulation (solid) loss and (F) learning dynamics of first training on the living kingdom (figure 3(A)) and subsequently on the respective task in (C).The analytical solution fails for the revised animal kingdom as it gets stuck in a saddle point, while the simulation escapes the saddle (top, green circle).Initial training on the living kingdom task from large initial weights and subsequent training on the colour hierarchy have similar convergence times (bottom) (G) multidimensional scaling (MDS) of the network function for initial training on the living kingdom task from small (top) and large initial weights (bottom).Note how despite the seemingly chaotic learning dynamics when starting form large initial weights, both simulations learn the same representation.(H) MDS of subsequent training on the respective task in (C).

AcknowledgmentL
B was supported by the Woodward Scholarship awarded by Wadham College, Oxford and the Medical Research Council [MR/N013468/1].C D and A S were supported by the Gatsby Charitable Foundation (GAT3755).Further, A S was supported by a Sir Henry Dale Fellowship from the Wellcome Trust and Royal Society (216386/Z/19/Z) and the Sainsbury Wellcome Centre Core Grant (219627/Z/19/Z).A S is a CIFAR Azrieli Global Scholar in the Learning in Machines & Brains program.J F was supported by the Howard Hughes Medical Institute.

Figure 8 .
Figure 8. (A) and (B)Mean and standard deviation on the error on the internal representation error defined as in section D for the learning the living kingdom task (figure6(A)), a random 7 × 7 matrix (blue), a random 5 × 7 matrix (yellow), a 7 × 5 matrix (green), a 8 × 8 matrix (red).All the task ran were ran with learning rate η = 0.001 enforcing initial zero-balanced weights 2.3 (dotted line) and breaking the assumption of zero-balanced initial weights 2.3 (line).N h = 10 for all networks.

H.5. Figure 3 Figure 4
Figure 4 panel (A) was generated by training a linear network with N i = 5, N h = 10, N o = 5 on the target Y as shown in equation (198) (equal diagonal).The network was initialised with σ = 0.1.The learning rate was η = 0.01.

Figure 4
Figure 4 panels (D)-(F) was generated by training a linear network with N i = 2, N h = 10, N o = 2 on the target Y as shown in figure 4(C) and input X = bf i.The network was initialised with small σ = 0.000 01, intermediate σ = 0.3 and large σ = 2 synaptic weights.The learning rate was η = 0.0001.