Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks

Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.


Introduction
Descent-based algorithms such as stochastic gradient descent (SGD) and its variants are the workhorse of modern machine learning.They are simple to implement, e cient to run and most importantly: they work well in practice.A detailed understanding of the performance of SGD is a major topic in machine learning.Quite recently, signi cant progress was achieved in the context of learning in shallow neural networks.In a series of works, it was shown that the optimisation of wide two-layer neural networks can be mapped to a convex problem in the space of probability distributions over the weights [1,2,3,4].This remarkable result implies global convergence of two-layer networks towards perfect learning provided that the number of hidden neurons is large, the learning rate is su ciently small and enough data is at disposition.This line of work is commonly referred to as the mean-eld or the hydrodynamic limit of neural networks.Mathematically, these works showed that one could describe the entire dynamics using a partial di erential equation (PDE) in  dimensions.
In a di erent, and older, line of work one-pass SGD for two-layer neural networks with a nite number  of hidden units, synthetic Gaussian input data and teacher-generated labels has been widely studied starting with the seminal work of [5].These works consider the limit of high-dimensional data and show, in particular, that the stochastic process driven by gradient updates converge to a set of  2 deterministic ordinary di erential equations (ODEs) as the input dimension  → ∞ and the learning rate is proportional to 1/.The validity of these ODEs in this limit was proven by [6].However, the picture drawn from the analysis of these ODEs is slightly di erent from the mean-eld/hydrodynamic picture: in this case SGD can get stuck for long time in minima associated to no specialization of the hidden units to the teacher hidden units, and even when it converges to specializing minima, it fails to perfectly learn (i.e. to achieve zero population risk).In fact, in this analysis, the interplay between the limit of the learning rate going to zero and  → ∞ appeared to be fundamental.
One should naturally wonder about the link between these two sets of works with, on the one hand a -dimensional PDE (with large ), and on the other a  2 -dimensional ODE (with large ).In this work we aim to build a bridge between these two approaches for studying one-pass SGD.
Our starting point is the framework from [5], which we build upon and expand to a much broader range of choices of learning rate, time scales, and hidden layer width.This allows us to provide a sharp characterisation of the performance   of SGD for two-layer neural networks in high-dimensions.We show it depends on the precise way in which the limit is taken, and in particular on how the quantity of data, the hidden layer width, and the learning rate scale as  → ∞.For di erent choices of scaling, we can observe scenarios such as perfect learning, imperfect learning with an unavoidable error, or even no learning at all.As a consequence of our analysis, we provide a phase diagram (see Figure 1a) describing the possible scenarios arising in the high-dimensional setting.Our main contributions are as follow: C1 We rigorously show that the dynamics of SGD can be captured by a set of deterministic ODEs, considerably extending the proof of [6] to accommodate for general time scalings de ned by an arbitrary learning rate, and a general range of hidden layer width.We provide much ner non-asymptotic guarantees which are crucial for our subsequent analysis.
C2 From the analysis of the ODEs, we derive a phase diagram of SGD for two-layer neural networks in the highdimensional input layer limit  → ∞.In particular, scaling both the learning rate  and hidden layer width  with the input dimension  as we identify four di erent learning regimes which are summarized in Figure 1a: • Perfect learning (green region,  > −): we show that perfect learning (zero population risk) can be asymptotically achieved with  ∼  1++ samples even for tasks with additive noise.
• Plateau (blue line  = −): learning reaches a plateau related to the noise strength.The point  =  = 0 goes back to the classical work of [5].
• No ODEs (red region  +  < − 1 /2): the stochastic process associated to SGD is not guaranteed to converge to a set of deterministic ODEs.This region is thus outside the scope of our analysis.
To better illustrate this phase diagram we present in Figure 1b a solution of the ODEs in all three regimes.
Relation to previous work -Deterministic dynamical descriptions of one-pass stochastic gradient descent in highdimensions have a long tradition in the statistical physics community, starting with single-and two-layer neural networks with few hidden units [7,8,9,10,11].The seminal work by [5] overcame previous limitations by constructing a set of deterministic ODEs for two-layer networks with any nite number of hidden units, paving the way for a series of important contributions [12,13,14,6].This line of work corresponds to the  =  = 0 case of Figure 1a.One of our goal is to generalize this picture beyond xed hidden layer size and learning rate.A more recent line of work investigating the dynamics of SGD is the so-called mean-eld limit [1,15,2,3,4], which connects the SGD dynamics of large-width two-layer neural networks to a di usion equation in the hidden layer weight density.In particular, [15] provide non-asymptotic convergence bounds for su ciently small learning rates, corresponding to the green region of Figure 1a (with  → ∞).The mean-eld approach computes the empirical distribution (in R  ) of the hidden layer weights, while we focus on the macroscopic overlaps between the teacher and student weights.

Setting
Consider a supervised learning regression task.The data set is composed of  pairs (  ,   )  ∈ [] ∈ R +1 identically and independently sampled from P(, ).The probability P() is assumed to be known and P(|) is modelled by a two layer neural network called the teacher.Given a feature vector   ∈ R  , the respective label   ∈ R is de ned as the output of a network with  hidden units, xed weights  * ∈ R × and an activation function  : R → R: where Given a new sample  ∼ P() outside the training data, the goal is to obtain an estimation f () for the respective label .The error is quanti ed by a loss function L(, f (, )), where  is an arbitrary set of parameters to be learned from data.
In this manuscript we are interested in the problem of estimating  * with another two-layer neural network with the same activation function, which we will refer to as the student.The student network has  hidden units and a matrix of weights  ∈ R × to be learned from the data.Given a feature vector  ∼ P() the student prediction for the respective label is given as where   ≡ [ ]  ∈ R  is the -th row of the matrix  and   ≡    / √  ∈ R is de ned as -th component of the student local eld vector  ∈ R  .
One-pass gradient descent -Typically, one minimizes the empirical risk over the full data set.Instead, learning with one-pass gradient descent minimizes directly the population risk: R(,  * ) ≡ E ,∼P(,) L  (,  * ), f (,  ) . ( Given a single sample (  ,   ) the weights are updated sequentially by the gradient descent rule: with  ∈ [] and  ∈ [].The parameter  > 0 is the learning rate.Despite being a simpli cation with respect to batch learning, one-pass gradient descent is an amenable surrogate for the theoretical analysis of non-convex optimization, since at each step the gradient is computed with a fresh data sample, which is equivalent to performing SGD directly on the population risk.
In particular, in this manuscript we assume realizability  ≥ , and focus our analysis on the square loss L(, ŷ) = 1 2 ( − ŷ) 2 , leading to where with population risk given by Therefore, from the above expression we can see that to monitor the population risk along the learning dynamics it is su cient to track the joint distribution of the local elds (,  * ).For Gaussian data P() = N( |0, 1), one can replace the expectation E ,∼P(,) [•] by E , * ∼N(, * |0,) [•] and fully describe the dynamics through the following su cient statistics, known in the statistical physics literature as macroscopic variables: ≡ E ,∼P(,)  *   *  = 1 with matrix elements, called order parameters in the statistical physics literature, denoted by    ≡ [  ]  ,    ≡ [  ]  and   ≡ [ ]  .The macroscopic state of the system at the learning step  is given by the overlap matrix and the population risk is completely determined by the macroscopic state: The training dynamics (6) de nes a discrete-time stochastic process for the evolution of the overlap matrix with  xed and   and   updated as: In what follows, we will make the concentration assumption  2 = ; this will be justi ed in the proof of Theorem 3.1.
We emphasize in (14) the speci c role played by each term in the right hand-side.The "learning" terms are the fundamental ones, that actually drive the learning of the teacher by the student.We show in Appendix C.3 that these "learning" terms are identical to those obtained in the gradient ow approximation of SGD, whose performance is the topic of many works [1,2,3,4].Those are precisely the terms that draw the population risk towards zero.However, in our setting there is an additional variance term (so that this ow approximation is incomplete) that corresponds to the uctuations of L(, ,  * ) around its expected value R(,  * ).In particular, this is where the e ects of the noise  can be felt.These terms were sometimes denoted as ( 2 ) and ( 4 ) in [16].We shall see that the additional "variance" term is the one responsible for the plateau in the critical (blue) region of Figure 1a, while its contribution vanishes in the perfect learning (green) region.
Additionally, albeit our work particularizes to Gaussian input data, we believe our conclusion, and the phase diagram discussed in Figure 1a, to hold beyond this restricted case.Indeed, while the Gaussian assumption is crucial to reach a particular set of ODEs and their analytic expression, the approach can be applied to more complex data distribution, as long as one can track the su cient statistics required to have a closed set of equations.For instance, [17] obtained very similar equations for an arbitrary mixture of Gaussians -that would obey the same scaling analysis as ours -while [18,19,20] proved that many complex distributions behave as Gaussians in high-dimensional setting, including, e.g.realistic GAN-generated data.We thus expect our conclusions to be robust in this respect.

Main results
Although  0 = / seems to be the most natural time scaling in the high-dimensional limit  → ∞, if  and  are allowed to vary with  the right-hand side (RHS) of Eqs. ( 14) can diverge and render the ODE approximation obsolete.Instead, for a given time scaling , we can rewrite Eqs.(14) as In Theorem 3.1 we prove that as  → ∞,   converges to the solution of the ODE: where  : R (+)×(+) → R (+)×(+) is the expected value of the RHS of Eqs.(15), provided that this solution stays bounded.This enhances the result of [6] by providing convergence rates to the ODEs encompassing all scalings adopted hereafter: Theorem 3.1 (Deterministic scaling limit of stochastic processes).Let  ∈ R be the continuous time horizon and  =  () be a time scaling factor such that the following assumptions hold: 1. the time scaling  satis es for some constant , Then, there exists a constant  > 0 (depending on , ,  ) such that for any 0 ≤  ≤ / , the following inequality holds: Our proof is based on techniques introduced in [21] (namely, their Lemma 2) which studies a di erent problem with related proof techniques.The proof involves decomposing  +1 as where the two rst terms can be considered as a deterministic discrete process, and the last term is a martingale increment.The main challenge lies in showing that the martingale contribution stays bounded throughout the considered time period.Although the method is similar to [6], there are a number of di erences between the two approaches.First, our proof xes a number of holes in [6], in particular bounding     by a su ciently slowly diverging function of .Additionally, the techniques used in this paper yield a dependency in  that is nearly negligible, while the previous methods imply bounds that are much too coarse for our needs.
The function  can be computed explicitly for various choices of , which allows to check Assumption 3 directly.We provide in Appendix C the necessary computations for  () = erf (/ √ 2); those for the ReLU unit can be found in [22].It can be checked that in the ReLU case, the function  is not Lipschitz around the matrices  satisfying for some  ≠ .However, in every case we have a weaker square-root-Lipschitz property: there exists  ∈ R such that for any ,  .Since the square root function is Lipschitz whenever the eigenvalues of  are bounded away from zero (see e.g.[23]), Assumption 3 is implied by the condition however, this assumption is much stronger, and becomes unrealistic in the specialization phase (as well as when  ).Theorem 3.1 allows us to safely navigate through Figure 1a by keeping track of convergence rates of the discrete process to a set ODEs.The interplay between learning rate and hidden layer width de nes the time scaling  and the trade-o between the linear contribution on E  and the quadratic one, playing a central role on whether the network achieves perfect learning or not.Speci cally, consider the following learning rate and hidden layer width scaling with where  0 ∈ R + and  0 ∈ N are constants.The exponent  ∈ R can be either greater or smaller than zero, while  ∈ R + .Replacing these scalings on Eqs. ( 14), we nd: where we have chosen  0 =  0 without loss of generality.
Since the distribution of the label noise P( ) is such that E  ∼P( ) [ ] = 0, the linear contribution in E  is noiseless in the high-dimensional limit  → ∞, and therefore we will refer to it as the learning term.The noise enters in the equations through the variance computed on the quadratic contribution E  E  , which we will refer to as the noise term; intuitively, it is a high-dimensional variance correction which hinders learning.In order to satisfy (17), we shall take When  +  ≠ 0, this implies that either the learning term or the noise term scale like a negative power of , and is negligible with respect to the other term.It is then easy to check that at a nite time horizon , the resulting ODEs behave as if the negligible term was not present.We refer to Theorem B.1 in the appendix for a quantitative proof of this phenomenon.Let us now describe the di erent regimes depicted in Figure 1a.
Blue line (plateau) -When  and  are scaled such that  = −, Eqs. ( 21) converge to with  0 ≡ 1/.This regime is an extension of [5] for which  =  = 0.The convergence rate to the ODEs scales with  −1/2 log(), and the phenomenology we observe for  =  = 0 is consistent with previous works studying the setting  =  = 0; namely the existence of an asymptotic plateau proportional to the noise level.For instance, the asymptotic population risk R ∞ is known to be proportional to  Δ [6] when  =  = 0 and the dynamics is driven by a rescaled version of Eqs.(23).Since the noise term does not vanish under this scaling, perfect learning to zero population risk is not possible.There is always an asymptotic plateau related to the noise level Δ, and the learning rate .
Green region (perfect learning) -If  > − we can de ne the time scaling  + ≡ 1/ 1++ .By Theorem 3.1, Eqs. ( 21) converge to the following deterministic set of ODEs: at a rate proportional to  −(1++)/2 log(), where we have highlighted that the noise term vanishes with  −(+) .Hence, as long as  > − the noise does not play any role on the dynamics.This setting could be understood by taking an e ect learning rate  e ∝  −− on R ∞ ∝  Δ, which leads to zero population risk, i.e. perfect learning, in the high dimensional limit  → ∞.We validate this claim by a nite size analysis in the next section.As discussed, the time scaling determines the number of data samples required to complete one learning step on the continuous scale.The bigger  + , the more attenuated the noise term, thus the closer to perfect learning.The trade-o is that the bigger  + , the larger the number of samples needed is, since  =  1++ .Given a realizable learning task, one would thus rather choose the parameters to attain the perfect learning region, but being as close as possible to the plateau line for not increasing too much the needed number of samples.We remark that [15] provides an alternative deterministic approximation in this regime, with non-asymptotic bounds, whenever  1; this is the so-called mean-eld approximation, with known convergence guarantees [2].
Orange region (bad learning) -We now step in the unusual situation where the learning rate grows faster with  than the hidden layer width:  < −.In this case, by (22) the noise term dominates over the dynamics.De ning the time scaling  2(+) ≡ 1/ 1+2(+) , we have According to Theorem 3.1 the convergence rate of Eqs.(21) to Eqs. ( 25) scales with  −(1/2++) log().Therefore the existence of the noisy ODEs above is circumscribed to the region and presents a convergence trade-o absent in the other regimes: the faster one of the contributions of Eqs. ( 21) goes to zero, the worse is the convergence rate.In the present case, the more the learning term is attenuated, i.e. the more negative is  + , the worse the dynamics is described by Eqs.(25).Although the weights are updated, the correlation between the teacher and the student weights parametrized by the overlap matrix  remains xed on its initial value  0 , which is a xed point of the dynamics under this scaling.Unsurprisingly, this leads to poor generalization capacity.
Red region (no ODEs) -If  +  < − 1 /2, the stochastic process driven by the weight dynamics does not converge to deterministic ODEs under the assumptions of Theorem 3.1.We are then not able to state any claim about this regime.
Initialization and convergence -There are two additional features worth commenting on the high-dimensional dynamics and its connection to the mean-eld/hydrodynamic approach, regarding initialization and the specialization transition.
In the ODE approach we discuss here, we always observe a rst plateau where the teacher-student overlaps are all the same.This means all the hidden layer neurons learned the same linear separator.At this point, the two-layer network is essentially linear.This is called a unspecialized network in [16,5].In fact, this is a perfectly normal phenomenon, as with few samples even the Bayes-optimal solution would be unspecialized [24].Only by running the dynamics long enough the student hidden neurons start to specialize, each of them learning a di erent sub-function so that the two-layer network can learn the non-trivial teacher.
Let us make two comments on this phenomenon: (i) while the "linear" learning in the unspecialized regime may remind the reader of the linear learning in the lazy regime [25,26] of neural nets, the two phenomena are completely di erent.In lazy training, the learning is linear because weights change very little, so that the e ective network is a linear approximation of the initial one.Here, instead, the weights are changing considerably, but each hidden neuron learns essentially the same function.(ii) If the ODEs are initialized with weights uncorrelated with the teacher, then the unspecialized regime is a xed point of the ODEs: the student thus never specializes, at any time.Strikingly, such condition arises as well in the analysis of mean-eld equations (see e.g.Theorem 2 in [27] that discusses the need to have spread initial conditions with a non-zero overlap with the teacher) to guarantee global convergence.
This raises the question about the precise dependence of the learning on the initialization condition in the highdimensional regime, where a random start gets a vanishing (1/ √ ) overlap.This is a challenging problem that only recently has been studied (though in a simpler setting) in [28,29,30] who showed it yields an additional log() timedependence.Generalizing these results for high-dimensional two-layer nets is an open question which we leave for future work.

Discussion, special cases, and simulations
To illustrate the phase diagram of Figure 1a, we present now several special cases for which we can perform simulations or numerically solve the set of ODEs.Henceforth, we take  () = erf (/ √ 2), for which the expectations of the ODEs and of the population risk, Eq. ( 12), can be calculated analytically [5].The explicit expressions are presented in Appendix C.
Teacher weights are such that   =   .The initial student weights are chosen such that the dimension  can be varied without changing the initial conditions  0 ,  0 ,  and consequently the initial population risk R 0 .A detailed discussion can be found in Appendix D.

Saad & Solla scaling 𝜅 = 𝛿 = 0
We start by recalling the well-known setting characterized by the point  =  = 0.The convergence of the stochastic process for xed learning rate and hidden layer width to Eqs. ( 23) was rst obtained heuristically by [5].In Figure 2 we recall this classical result by plotting the population risk dynamics for di erent noise levels.Dots represent simulations, while solid lines are obtained by integration of the ODEs, Eq. ( 23).
Learning is characterized by two phases after the initial decay.The rst is the unspecialized plateau where all the teacher-student overlaps are approximately the same:   ≈ .Waiting long enough, the dynamics reaches the specialization phase, where the student neurons start to specialize, i.e., their overlaps with one of the teacher neurons increase and consequently the population risk decreases.This specialization is discussed extensively in [5].If Δ = 0, the population risk goes asymptotically to zero.Instead, if Δ ≠ 0, the specialization phase presents a second plateau related to the noise Δ.
The asymptotic population risk R ∞ related to the second plateau is proportional to  Δ [6] in the high-dimensional limit  → ∞ with  nite.As mentioned in the previous section, the expectation over E  E  in Eq. (23a) prevents one from obtaining zero population risk for a noisy teacher.

Perfect learning for 𝜅 = 0
In this section we study the line  = 0 with  > 0 of Figure 1a, for which Eqs. (24) with  = 0 hold.We show that perfect learning can be asymptotically achieved in the realizable setting for any nite hidden layer width  =  0 .Keeping  and Δ xed, we have done simulations increasing the input layer dimension .In Figure 3a we set  = 1/2, Δ = 10 −3 and vary the input layer dimension.The bigger  is, the closer we are to the ODE-derived noiseless result.
Gathering the asymptotic population risk from simulations for varying  and Δ we perform a nite-size analysis to study the dependence of R ∞ with .This shows that the noise term goes to zero under this setting.In Figure 3b we plot R ∞ versus  from simulations (dots) for di erent noise levels.We t lines under the log-log scale showing that R ∞ ∝  − , as expected.Figure 4 draws the same conclusion for  = 1/4.
As already stated, the interplay between the exponents directly a ects the time scale.We end this subsection by graphically illustrating this fact through simulations.Setting the noise to Δ = 10 −3 we compare the cases  = 0, 1/4, 3/8, 1/2 in Figure 5a.All simulations are rendered on the scale  0 = 1/ to illustrate the trade-o between asymptotic performance and training time.= 1000 ode (a) Population risk dynamics for  = 0 and  = 1/2.Fixed noise Δ = 10 −3 and varying .Dots represent simulations, while the solid line is obtained by integration of the ODEs given by Eqs.(24).The data are compatible with the claim that as  → ∞ the curve converges to zero population risk.= 1000 ode (a) Population risk dynamics for  = 0 and  = 1/4.Fixed noise Δ = 10 −3 and varying .Dots represent simulations, while the solid line is obtained by integration of the ODEs given by Eqs.(24).The data are compatible with the claim that as  → ∞ the curve converges to zero population risk.

Bad learning for 𝜅 = 0
We now quickly discuss the uncommon case of  growing with  within the orange region.In Figure 5b we compare simulations varying  with the solution of the ODEs given by Eqs.(25).Both lead to poor results compared to the green and blue regions.Moreover, this regime presents strong nite-size e ects, making it harder to observe the asymptotic ODEs at small sizes.However, the trend as  increases is very clear from the simulations.As discussed in Section 3, the more the learning term is attenuated on the ODEs, the worse they describe the dynamics.

Large hidden layer: 𝜅 > 0
Finishing our voyage through Figure 1a with examples, we brie y discuss the case where both input and hidden layer widths are large.Although Theorem 3.1 provides non-asymptotic guarantees for  > 0, the number of coupled ODEs grows quadratically with , making the task of solving them rather challenging.Thus, we present simulations that illustrate the regions of Figure 1a.Fixing  = 100 we show in Figure 6 learning curves for di erent values of  and .
The colors are chosen to match their respective regions in the phase diagram.Due to the relatively small sizes used in Figure 6, the green dots seem to decrease towards perfect learning, even when  < 0, provided that  is large enough, as is predicted by the phase diagram in Figure 1a.Moreover, since  is not large enough, when the parameters are within the orange region the nite-size e ects actually dominates, similarly to Figure 5b.The learning contribution still plays a role and the asymptotic population risk is similar to the case  =  = 0. Within the red region, which is out of scope of our theory, the simulation gets stuck on a plateau with larger population risk.

Conclusion
Building up on classical statistical physics approaches and extending them to a broad range of learning rate, time scales, and hidden layer width, we rendered a sharp characterisation of the performance of SGD for two-layer neural networks in high-dimensions.Our phase diagram describes the possible learning scenarios, characterizing learning regimes which had not been addressed by previous classical works using ODEs.Crucially, our key conclusions do not rely on an explicit solution, as our theory allows the characterization of the learning dynamics without solving the system of ODEs.The introduction of scaling factors is non-trivial and has deep implications.Our generalized description enlightens the trade-o between learning rate and hidden layer width, which has also been crucial in the mean-eld theories.
Although the theorem wasn't originally proven in the  → ∞ setting, a glance at its proof shows that it still holds upon replacing  () by  (, ) in Assumption A.1.1 and A.1.2, as well as Equation (A.3).We choose • to be the  ∞ norm, since it suits better the  → ∞ scaling.The  in Theorem A.1 corresponds to 1/, where  is de ned in Theorem 3.1.
Following [21], we de ne for ,  ∈ [] and With that, we write where for ,  ∈ [] The main obstacle to bounding   and   is the fact that the    can a priori diverge to in nity.Our rst task is therefore to show that this does not happen; as a proxy we show a subgaussian-like moment bound: Equipped with the above bound, controlling E   2 and E   becomes fairly easy.All proof details are in the below sections.
A.1 Preliminaries: bounding the    Since  is -Lipschitz, we have by the Cauchy-Schwarz inequality De ne where  1 ,  2 are absolute constants.Summing those inequalities yield and nally As a result, we have for any 0 For simplicity, let   denote any of the     .We have, for all  ≥ 0, where the remainder term has bounded expectation.Again, we write . By Assumption 3, the    are bounded from below by a constant, hence This implies that for any  ≥ 0 and 0 ≤  ≤ , A.2 Assumption A.1.1 We have for all ,  ∈ [ + ], The term in (E  ) 4 is bounded by the same techniques as the last section.For the second term, A similar bound holds for the    , and hence which implies Assumption A.1.1 with  1 = 1 and  (, ) =  () log().

A.3 Assumption A.1.2
Since  is Lipschitz, for any ,  ∈ [] The rst expectation is the variance of a  2  random variable, which is equal to 2, and the second expectation is bounded by the same methods as the above sections.The term in brackets is therefore bounded by  1 √ , and Finally, since for any  > 0 we have  2 ≤ max(,  2 ) 3/2 , letting  = / we nd

A.4 √ -Lipschitz property
Let ,  ∈ R (+)×(+) , we can write the (, ) coe cient of  () as    ( √ ), where The same arguments as above show that the function  is Lipschitz, and hence for some constant  we have

B A lemma on ODE perturbation
In this section, we prove a proposition that bounds the di erence between an ODE solution and a perturbed version, for a bounded time .
Theorem B.1.Let  ,  : R  → R  be two -Lipschitz functions, and consider the following di erential equations in R  : where  > 0, and with the initial condition  (0) = (0) .Then, if  > 0 is xed, we have for any 0 ≤  ≤ , with  a constant independent from , .
Before proving this proposition, we begin with a small lemma: = max(, 1) + max( √ z, 1), with z (0) = 0, then  () ≤ z () for all  ≥ 0. Since the RHS of the above equation is Lipschitz everywhere, we can apply the Picard-Lindelöf theorem, and check that the unique solution to this equation is where  1 and  2 are ad hoc constants.The lemma then follows from adjusting the constant  as needed.
We are now in a position to show Theorem B.1: Proof.Assume for simplicity that  (0) = (0) = 0. We begin by bounding  (); we have

C Expectations over the local elds
In this appendix we present the explicit expressions from the expectations of the local elds used to compute the population risk and the ODE terms.

C.1 Population risk
We write the population risk (12) as  where Ω

𝛼 𝛽𝛾 𝑗𝑙
≡ (   )  is an element of the covariance matrix given by Eq. (C.7).As examples, we write explicitly: The quadratic contribution in E  is given by Recalling the de nition E  =  (  ) E, the terms present inside the expectation are exactly those in the learning term of Eq. ( 14).

D Initial conditions and symmetric teacher
In this work we have constructed teacher matrices  * ∈ R × in order to have where  *  ≡ [ * ]  ∈ R  is the  -th row of the matrix  * .We have started by sampling  vectors of dimension  uniformly on a ball of radius √ .Then we constructed an orthonormal basis using singular value decomposition.The initial student weights  0 ∈ R × were taken as with each row of  ∈ R × sampled uniformly on a ball of radius one.We acknowledge choosing initial student weights as linear combinations of the teacher can be arti cial and shrinks the rst plateau, but our focus on this work was the specialization phase.Nevertheless, this choice and Eq.(D.1) are particularly suitable to theoretical analysis.Once  and  are xed, the dimension  can be varied without changing  0 ,  0 and  , thereby removing any in uence of di erent initial conditions for di erent  and providing the reader better visualization on the learning curves.Thus once  is xed, the input dimension  can be varied without a ecting the initial conditions.We chose to sample   ≡ []  ∈ R  on a ball of radius one both to introduce some randomness on the initialization and to keep the initial parameters bounded by one.
We stress that we use these initial conditions to make the data comparable for varying dimension  in the numerical illustrations.Our conclusions do not depend on this particular choice of initial conditions.If one simply takes random initialization   ∼ N(  |0, 1) for each , the full picture we have presented in this manuscript remains unchanged.In Figure 7 we present an example of curves within the blue region (see Section 3 for the characterization of this regime) with unconstrained Gaussian initialization.Dots represent simulations, while solid lines are obtained by integration of the ODEs given by Eqs.(23), with initial conditions adjusted to match simulations.
Although varying the initial population risk with  slightly changes the exact position where the specialization transition starts, the particular initial conditions adopted in this work do not a ect whether the specialization transition takes place or not, comparing to unconstrained Gaussian initialization.

Figure 1 :
Figure 1: Phase diagram (left) and typical behavior of the ODE in each regions (right).
with  *  ≡ [ * ]  ∈ R  as the  -th row of the matrix  * and  *   ≡  *    / √  ∈ R as the  -th component of the teacher local eld vector  *  ∈ R  .The parameter Δ ≥ 0 controls the strength of additive label noise:
Simulations ( = 1000) for  = 0 comparing di erent choices of the exponent .The nal plateau is proportional to learning rate: R ∞ ∝  Δ.
Population risk dynamics for  = 0 and  = −3/8.Dots represent simulations, while the solid line is obtained by integration of the ODEs given by Eqs.(25).
, ) , De ne the vector    ≡   ,   ∈ R 2 , where the upper indices on the components indicate they may refer to student or teacher local elds.Consider the covariance matrix on the subspace spanned by   :   ≡ E , * ∼N(, * |0,)      ∈ R 2×2 .