An analytical theory of curriculum learning in teacher–student networks

Abstract In animals and humans, curriculum learning—presenting data in a curated order—is critical to rapid learning and effective pedagogy. A long history of experiments has demonstrated the impact of curricula in a variety of animals but, despite its ubiquitous presence, a theoretical understanding of the phenomenon is still lacking. Surprisingly, in contrast to animal learning, curricula strategies are not widely used in machine learning and recent simulation studies reach the conclusion that curricula are moderately effective or even ineffective in most cases. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. We study a task in which a sparse set of informative features are embedded amidst a large set of noisy features. We analytically derive average learning trajectories for simple neural networks on this task, which establish a clear speed benefit for curriculum learning in the online setting. However, when training experiences can be stored and replayed (for instance, during sleep), the advantage of curriculum in standard neural networks disappears, in line with observations from the deep learning literature. Inspired by synaptic consolidation techniques developed to combat catastrophic forgetting, we propose curriculum-aware algorithms that consolidate synapses at curriculum change points and investigate whether this can boost the benefits of curricula. We derive generalisation performance as a function of consolidation strength (implemented as an L 2 regularisation/elastic coupling connecting learning phases), and show that curriculum-aware algorithms can yield a large improvement in test performance. Our reduced analytical descriptions help reconcile apparently conflicting empirical results, trace regimes where curriculum learning yields the largest gains, and provide experimentally-accessible predictions for the impact of task parameters on curriculum benefits. More broadly, our results suggest that fully exploiting a curriculum may require explicit adjustments in the loss.


Introduction
Presenting learning materials in a meaningful order according to a curriculum greatly helps learning in animals and humans [1,2,3,4], and is considered an essential aspect of good pedagogy [5].For example, humans have been shown to learn visual discriminations faster when presented with examples that exaggerate the relevant difference between classes, a phenomenon known as "fading" [6,7,8].Beyond humans, curricula in the form of "shaping" or "staircase" procedures are a nearuniversal feature of task designs in animal studies, without which training often fails entirely.For instance, the International Brain Laboratory task, a standardised perceptual decision-making training paradigm in mice, involves six stages of increasing difficulty before reaching final performance [9].
Building from this intuition, a seminal series of papers proposed a similar curriculum learning approach for machine learning (ML) [10,11,12].In striking contrast to the clear benefits of curriculum in biological systems, however, curriculum learning has generally yielded equivocal benefits in artificial systems.Experiments in a variety of domains [13,14] have found usually modest speed and generalisation improvements from curricula.Recent extensive empirical analyses have found minimal benefits on standard datasets [15].Indeed, a common intuition in deep learning practice holds that training distributions should ideally be as close as possible to testing distributions, a notion which runs counter to curriculum.Perhaps the only areas where curricula are actively used are in large language models [16] and certain reinforcement learning settings [17].
This gap between the effect of curriculum in biological and artificial learning systems poses a puzzle for theory.When and why is curriculum learning useful?What properties of a task determine the extent of possible benefits?What ordering of learning material is most beneficial?And can new learning algorithms better exploit curricula?Compared to the empirical investigations of curriculum learning, theoretical results on curriculum learning remain sparse.Most notably, [18,19] show that curriculum can lead to faster learning in a simple setting, but the effects of curriculum on asymptotic generalisation and the dependence on task structure remain unclear.A hint that indeed curriculum learning might lead to statistically different minima comes from a connection between constraintsatisfaction problems and physics results on flow networks [20], but to our knowledge no direct result has been reported in the modern theoretical ML literature.
In this work we study the impact of curriculum using the analytically tractable teacher-student framework and the tools of statistical physics [21,22,23,24].High-dimensional teacher-student models are a popular approach for systematically studying learning behaviour in neural networks [25,26,22], and have recently been leveraged to analyse a variety of phenomena [27,28,29,30,31,32].Using a simple model to build structured data [12], we examine the impact of ordering examples by increasing difficulty (curriculum), decreasing difficulty (anti-curriculum), or standard shuffled training.We derive exact expressions for the online learning dynamics and the performance of batch learning.However, in the latter, curriculum confers no benefit under standard training in our model setting.Motivated by theories of synaptic consolidation and elastic weight consolidation [33,34], we introduce elastic penalties (Gaussian priors) that regularise training toward solutions obtained in prior curriculum phases, instantiating a long-term memory effect.With these priors, curriculum yields benefits both in the online 3 and in the batch 4 settings.
Further related work.The first empirical investigation of curriculum learning appeared in 1927 [35], consisting in a visual discrimination task for dogs under curriculum and no-curriculum paradigms.Later behavioural studies proved curricula to be beneficial independent of the animal (dogs, mice, rats, pigeons, humans) and the data modality (visual, auditory, or tactile stimuli) [36,1,2,37,38,6].However, these experimental observations were not observed in standard artificial neural networks (ANNs).Several ideas in the connectionist community were proposed in order to show curriculum effects in the learning dynamics of ANNs [39,40,10,11].While these studies were able to match previous experimental data, they also required substantial changes in the architecture of the ANN and/or in the learning rule.
Except for very few instances [16,17], standard ML practice tends to avoid taking curricula into account.An obvious obstacle is the fact that most datasets do not provide meta-data about sample difficulties.An interesting line of research pointed out the possible relevance of implicit curricula, based on the observation that neural networks tend to consistently learn the samples in a certain order [41].Thus, a possible way of addressing the lack of difficulty labels would be to use the natural learning order as indicative of the various difficulties of the training samples.However, a recent work [15], which compared several heuristics for curriculum learning -including implicit curriculain a variety of settings, showed limited benefits with this strategy.
The picture that emerges from the literature seems contradictory: on the one hand, curricula appear fundamental to biological learning; on the other hand, curricula appear largely irrelevant in many machine learning settings.The core motivation behind our work is to reconcile these views and contribute to a theoretical understanding of curriculum learning.Since the teacher network is sparse, its output depends only on a subset of relevant input features.(b) We consider curricula which order examples by difficulty, here taken to be the variance in the irrelevant feature dimensions.We refer to increasing, decreasing, and random difficulty order as curriculum, anti-curriculum, and no curriculum, respectively.(c) Example test error on hard examples for the student over training.The switch-point between easy and hard samples lies at α = 1/2.Solid lines show numerical simulations, while dashed lines show theoretical predictions derived in Section 3.For this particular parameter setting, curriculum speeds learning but only modestly improves final performance at α = 1.Parameters:

Model definition and overview of approach
In the following, we revisit a prototypical model of curriculum learning from [12] that finds correspondence to the fading literature [6] as highlighted in Sec. 5. Our setting is summarised in Fig. 1.
The model entails a simple teacher-student setup, where teacher and student are each shallow 1-layer neural networks of size N (also known as perceptrons).The learning task for the student is a binary classification problem, with dataset D = {(y µ , x x x µ )} M µ=1 , where the ground-truth labels are produced by the teacher network y µ = sign W W W T • x x x µ .The student learns via empirical risk minimisation of an L 2 regularised convex loss.
A key feature of this model is that the teacher network is sparse, with only a fraction ρ < 1 of ∼ N (0, 1) non-zero components.Therefore, in order to achieve a good test accuracy, the student has to guess which components should be set to zero and align the relevant weights in the correct direction.A large range of 0 < ρ < 1 could give rise to the phenomenology we seek to analyse.In the remainder of the paper we will focus on the case ρ = 0.5.
We model the variable degree of difficulty in the samples by decomposing each input vector as , where x x x µ r ∈ R ρN denotes the relevant components of the input, and x x x µ i ∈ R (1−ρ)N the irrelevant ones.Note that, crucially, the sparse teacher network is completely blind to the irrelevant part of the input: y µ = sign ρN j=1 W T,j x µ r,j .While x µ r,j i.i.d.N (0, 1) , ∀µ, 1 we consider the variance for the irrelevant components to be sample-dependent x µ i,j ∼ N (0, ∆ µ ).A smaller variance in the irrelevant part induces a higher SNR in the student learning problem.
The dataset is partitioned according to difficulty levels given by the variances of the irrelevant inputs.For simplicity we consider only two partitions in most of our analysis, but generalisations to multiple difficulty levels follow straightforwardly.We thus have a dataset with M = (α 1 + α 2 )N = αN samples in total.In the first α 1 N samples the irrelevant inputs have variance ∆ 1 , while for the remaining α 2 N samples the variance is ∆ 2 > ∆ 1 .In the curriculum learning condition we present the easy examples first, while in the anti-curriculum condition we present the hard examples first.Standard learning presents examples shuffled in random order.

Online dynamical solution in the large input limit
We start by focusing on the same online learning setting explored in [12].We consider a 1-layer student network with sigmoidal activation function, σ(•) = erf(•/ √ 2), that learns to minimise a mean square error loss with L 2 regularisation of intensity γ, using gradient descent.This yields the updates The dynamics of the model can be analysed in the high-dimensional limit N, M → ∞ with α = M/N = O(1).Generalising the results of [26,42] on the online stochastic gradient descent dynamics in single-layer regression problems, we obtain a precise description of the performance at all times, as a function of several order parameters: the squared norm of the relevant and irrelevant part of the student weights , respectively; the overlap of the relevant weights of the student and teacher R = 1 N W W W r • W W W T ; and the squared norm of the teacher vector In particular, given Q r , Q i , R and T , the test loss (i.e.average loss on a new example) on a dataset with variance ∆ in the irrelevant inputs is given by the accuracy by If the dataset contains a random mixture of different difficulty levels ∆ 1 , ∆ 2 , . . ., the loss and accuracy can be obtained by taking a weighted average over the partitions.
To understand how test performance changes through learning, we study the evolution of the order parameters.Combining their definition with the definition of the dynamics (1) and the fact that the random variables concentrate in the high-dimension as N → ∞, we obtain an analytic form for the updates: where f Qr , f Qi and f R are long but explicit expressions that are reported in the supplementary material (SM).
Dynamical advantages of curriculum.With these theoretical results in hand, we can now characterise the performance of curricula in the online setting.We obtain a description of the learning trajectories for each learning protocol, yielding the evolution of training and test accuracies, and of other observables such as the norm of the student and its overlap with the teacher.
Solving the dynamical equations gives two key advantages relative to simulating models in this setting.First, they are free of finite size effects and stochastic fluctuations.And second, their evaluation is very fast (up to 6 orders of magnitude in simulation time reduction see SM E), enabling systematic exploration of the parameter space of the problem, along with fine-grained optimisation over hyper-parameters such as learning rate, weight decay and scaling in the initialisation.
Optimising final test accuracy separately for each curriculum strategy, we find that curriculum learning is the optimal strategy, followed by baseline (no-curriculum) and lastly anti-curriculum.In Fig. 1c we show typical learning trajectories for a dataset with equal numbers of easy and hard samples.The results of the simulations (solid lines) are well-described by our theoretical equations (dashed lines), and show that the curriculum strategy leads to better performance throughout training.Fig. 1c shows the evolution during training of the test accuracy computed on the whole dataset.
Next, we systematically trace the effect of curriculum for a range of total dataset sizes (α 1 + α 2 ) and number of easy examples α 1 in the phase diagram in Fig. 2.This diagram shows in panels (a) and (b) the accuracies on hard instances reached at the end of training, by curriculum learning and anti-curriculum learning respectively, normalised by the accuracy reached by the standard strategy.
The two heatmaps show that curriculum learning always outperforms standard learning and that, on the other hand, anti-curriculum learning outperforms standard learning only in part of the diagram.
Comparing the two strategies, in Fig. 2 (c), we can observe that there is a region for small α and α 1 where anti-curriculum learning is the best strategy, while in the majority of the situations curriculum    learning is best.Interestingly, there is a sizeable region of the diagram in which both curriculum and anti-curriculum help, possibly explaining why both have been recommended in prior work [12,14,43,44,45].A possible intuition behind this counter-intuitive phenomenon highlighted by our analysis is that, in some settings, the large amount of noise contained in the hard data will always be too disruptive for effective learning.Thus, leaving the easy (cleaner) data for last could allow the model to better exploit it.
Further, we find that our setting, in which a small task-relevant signal is embedded in large taskirrelevant variation, is critical to the benefit of curriculum.Fig. 4 shows performance as a function of sparsity ρ, additional details are deferred in the SM C. Non-sparse tasks do not benefit.Hence curriculum aids tasks with many irrelevant factors of variation.Interestingly, the literature from human psychology shows precisely this: no curriculum benefits for low-dimensional tasks or tasks with no variation in irrelevant dimensions [6].
Our results also highlight the intricate dependence of curriculum on parameters of the learning setup.
If not all parameters are correctly optimised, we can observe more complex scenarios.For instance, the initialisation condition for the norm of the weights of the student plays an important role.We explore this dependence by changing the variance of the normal distribution from which the initial weights are sampled from.We observe that anti-curriculum learning becomes the best strategy when the variance is large, as shown in Fig. 3 for weights of order 1.In this case, curriculum learning shows an advantage only in the first phase when easy examples are shown, which is consistent with the results of [19].However, in the next phase when hard examples are shown, the curriculum strategy does not extract enough information and it is outperformed by the other two strategies.The fact that curriculum or anti-curriculum can look better depending on the parameter setting might help explain the confusion in the literature over the best protocol [12,14,43,44,45].At least in this model, better performance from anti-curriculum is a signature of a sub-optimal choice of the parameters.
To summarise our findings in this online learning setting, curriculum mainly offers a dynamical advantage: it speeds up learning but has minimal impact on asymptotic performance.

Batch learning solution
The previous section discussed the online case where each example is used once and then discarded.However, in common machine learning practice, neural networks typically revisit each sample repeatedly until convergence.Therefore an important question is: can curricula lead to a generalisation improvement when trained on the same dataset until convergence?
We investigate this question by considering a student that learns from slices of a dataset in distinct optimisation phases, where in each phase the student optimises a L 2 -regularised logistic loss.Without   further modification, curriculum can have no effect in this setting: due to the convex nature of the teacher-student setup [22], the network is bound to converge to a minimum uniquely determined by the final slice of data, with no memory of the progress made at intermediate steps.This simple observation may help explain empirical observations on real data, such as [15], which find no benefit of curriculum in standard settings.In fact, in principle curriculum could still influence non-convex problems [12] but empirical results in the ML field are not showing clear signals of memory retention.A possible explanation of this is that relying on dynamical memory effects requires careful tuning of the learning rate and of the number of training epochs, while typical choices for these hyperparameters could lead to memory loss and performance inconsistencies.These observations raise the theoretical question of how to better implement curriculum learning to induce a non-vanishing effect also in batch learning settings.
To instantiate a long-term memory effect in our model, we propose biasing the optimisation landscape via a Gaussian prior, centred around the optimiser of the previous learning phase.The additional term in the loss acts as an elastic coupling between the successive phases, and the associated intensity γ 12 is then an additional hyper-parameter of the model.This scheme is similar to regularisation methods proposed against catastrophic interference in continual learning, such as Synaptic Intelligence [46].Changing the loss according to the curriculum prescription effectively makes the learning algorithm aware of the different levels of difficulty in the dataset.
Tools from statistical physics can be used to analytically compute test performance under this scheme.
In order to simplify the presentation, we first consider just two learning phases.It is natural to frame this setting as a 2-level problem, involving two systems with independent copies of the network weights W W W 1 and W W W 2 .In a typical statistical physics approach, we associate a Boltzmann-Gibbs measure to the systems, with an energy function determined by the regularised logistic loss L γ .While the statistical properties of the first system can be determined self-consistently, the added elastic interaction creates a dependence of the second measure on the configurations of the first system.In mathematical terms, the coupled system is represented by the following partition function: where D 1 , D 2 denote the two dataset slices.This object represents the normalisation of the Boltzmann-Gibbs measure, and allows one to extract relevant information on the asymptotic behaviour of our model.The optimisations entailed in each learning phase can be described in the "low noise" limit of β 1 , β 2 → ∞, where the measures focus on the minimisers of the respective losses.In order to study a self-averaging quantity that does not depend on a specific realisation of the dataset, we aim to compute the associated average free-energy: This quantity can be seen as a special case of the so-called Franz-Parisi potential computation [47,48], and the entailed double average can be evaluated through the replica method.Refer to SM for details.
Similar to the online case, in high-dimensions the free-entropy concentrates on a deterministic function that depends on several order parameters that capture the geometrical distribution of teacher and student configurations.In addition to those already introduced in Sec. 3, we also have δQ, which is linked to the variance of the student norm.Moreover, for each order parameter we also need to introduce a conjugate parameter, denoted in the following with the hat symbol.The final expression for the free-energy reads: where g S and g E are two scalar functions, often called entropic and energetic channels, that encode the dependence of the optimisation problem on the Gaussian prior and the logistic loss respectively.The extremum condition for the free-energy yields a system of fixed-point equations that converge to an asymptotic prediction for the order parameters, comparable with the results of numerical simulations on large instances, Fig. 4. At convergence, the order parameters can be inserted again in Eq. 2 to obtain an estimate of the test accuracy.Note that this formalism is not limited to two phases, but can be extended to the case of a discrete number of sequential stages.
The importance of sparsity.Sparsity is a key ingredient in determining the impact of curriculum strategies.It naturally introduces a notion of relevant and irrelevant inputs, and defines a secondary learning goal: identifying what part of the presented data should be disregarded by the model.Curriculum learning can aid this identification process, since the easy samples are more transparent to this structure.This is also observed in human experiments [6].However, the relative difficulty of the problem of inferring the support of the teacher and the problem of aligning with its non-zero components depends on the degree of sparsity ρ, so the effectiveness of curriculum can vary with it.
In the right panel of Fig. 4, we explore the interplay between the sparsity of the teacher ρ and the fraction of easy samples in the dataset α 1 , comparing curriculum with the no-curriculum baseline.The phase diagram highlights the variability in the impact of the curriculum ordering: • Curriculum is most effective at low values of ρ and close to the diagonal, where the fraction of easy examples in the dataset is comparable to the fraction of relevant dimensions.
• When ρ > 0.5, the possible gain from ordering the samples according to difficulty is counterbalanced by the instrinsic cost of splitting the information content into two blocks, thus curriculum can become detrimental.• When α 1 is too small compared to ρ (above diagonal), the first stage in the curriculum strategy can only help in the support identification problem, but will not allow a good estimation of the direction of the teacher.Because of the elastic prior, the second stage cannot improve too much over it and the effect of curriculum is small.• When α is larger than the sparsity (below diagonal), the easy examples contain sufficient information for solving both the support and the teacher estimation problems, and this information is also exploited by the baseline.Thus the improvement of curriculum becomes negligible.
We refer to the SM for an in-depth comparison with anti-curriculum.Asymptotic advantages of curriculum.Contrary to the case of online SGD, if the fraction of relevant directions is small, batch learning with elastic coupling notably improves test accuracy of both curriculum and anti-curriculum above the baseline.This confirms the utility of curriculum strategies when the signal is partially "hidden in clutter" [49].Fig. 5 shows similar phase diagrams to Fig. 2 but for the batch setting.At each point in the phase diagram the regularisation level γ 1 = γ 2 and the coupling γ 12 are optimised to yield the best accuracy.We find that the performance order is nearly always preserved: curriculum followed by anti-curriculum followed by baseline.In the SM we see similar improvements by applying the elastic coupling strategy both in the online setting and on real data.In summary, in the batch setting, splitting the learning process in stages might not be advantageous per se.However, our observations show that if the loss is modified to reduce memory loss between the learning stages, curriculum learning strategies can offer a measurable asymptotic advantage.

Connection with experimental literature
Recent work has suggested that curriculum learning could provide an important window into the learning algorithms at work in biology [51].Our analysis makes several predictions for curriculum effects.In this section we assess these predictions based on connections to extant experiments and propose future experimental tests.
First, we find that a curriculum strategy yields a speed up in learning in all the tested settings (see Fig. 1c).This acceleration is broadly consistent with the findings from cognitive science [1,2,6].By contrast, our results show that the speed improvement does not necessarily translate into a sizeable  The ratio shows non-monotonic behaviour.Bottom: The accuracy ratio obtained by [50].Parameters ρ = 0.5, ∆ 1 = 0.0, ∆ 2 = 1.0, α 1 = 1, α 2 = 1 and optimal learning rate, norm at initialisation and weight decay intensity.(b) Top: Dependence on the sparsity of the generalisation gain of curriculum over no-curriculum, measured as ratio between final accuracy, for fixed total dataset size (α 1 + α 2 = 1).Bottom: The ratio obtained from experiments 3 and 4 of [6].(c) Example cartoon stimuli from the "fading" paradigm used in [6], where participants distinguish daemons of the old world from daemons of the new world.The distinguishing feature (horn length) is diluted among many irrelevant features (colour, eye size, mouth size).Highlighting the relevant feature to participants leads to better and faster learning.generalisation error improvement, and the performance achieved at the end of training can even deteriorate when learning hyperparameters are not fully optimised (c.f.Fig. 3).Deterioration due to curricula has generally not been reported in the psychology literature, though it has been observed in ML [15].This fact may suggest that animals naturally learn with near-optimal hyperparameters such that curricula generally confer benefits.
A more specific observation concerns the performance on different difficulties after learning.As reported in [50], human and rodent subjects trained in an auditory task using curricula showed the greatest improvement for intermediate levels of difficulty as depicted in Fig. 6a bottom panel.The same conclusion can be drawn from the experiment of [7,8], where, surprisingly, subjects trained with curricula to classify medical images showed poor performance in hard tasks compared to the control group.To address this phenomenon, we calculate accuracy as a function of difficulty in the model in Fig. 6a top panel.Consistent with these experiments, we find regimes where the gap between curriculum learning and the baseline is non-monotonic, with the largest performance gain for intermediate difficulties.Contrary to [7,8], however, we do not observe negative effects of curriculum for high difficulties.Further experiments that more systematically manipulate training and transfer difficulties could provide a stronger test of these predictions.
A key ingredient in our model is the role of sparsity, such that a small signal is embedded amidst many irrelevant features.Experimentally, the importance of having many factors of variation to obtaining a curriculum effect has been documented in the "fading" experiments of [6].Human subjects were trained on classification tasks involving stimuli with one task-relevant feature dimension and a variable number of task-irrelevant feature dimensions.Example cartoon "daemon" stimuli are depicted in Fig. 6c, where for instance horn height might be the distinguishing feature while colour, eye size, and mouth size might constitute task-irrelevant features.Without any irrelevant factors of variation (ρ = 1), they report no curriculum benefit.By contrast when 75% of features are irrelevant (ρ = .25),they record a strong curriculum effect, as shown in Fig. 6b bottom.This qualitative trend is also observed in our model (Fig. 6b top).While these experiments tested only two sparsity levels, further experiments could sample this dimension more extensively and test for interactions with the fraction of easy and hard examples.We note that while the connectionist literature has addressed the effect of curriculum in several settings [39,40,10,11], we found that easy-to-hard effects appear even in a simple setup without need for complex networks and/or dynamics.Finally, our results may shed light on self-generated curricula during human development [52,53].Children undergo a vocabulary spurt that coincides with their ability to grasp and centre objects in the  visual field [53].Quantitative estimates of the amount of clutter (irrelevant objects) in self-generated views decrease due to this grasping ability, yielding a self-generated curriculum [49,54].Our model similarly predicts that reducing clutter should improve learning speed and performance.
Real-World Demonstration.To verify this prediction in a richer visual setting, we construct a simple cluttered object classification task from the CIFAR10 dataset [55] by patching two images together into a 32 × 64 input image (Fig. 7a).The task is to produce the class label of the image on the left.The right image is a distractor that is irrelevant to the classification.To vary difficulty, we scale the contrast of the irrelevant image (Fig. 7a-d).We train a single-layer network with the cross-entropy loss and the curriculum protocol with Gaussian prior between two curriculum stages, implemented in Pytorch Lightning to ensure that training parameters accord with standard practice.We optimised hyperparameters in each curriculum phase separately.We trained all combinations of five elastic penalties log spaced between 1e − 3 and 1e2, and weight decay parameters {0, .2,.5}.We then compute the best performing model for five random seeds and take the mean over seeds.Further dataset, model and experimental details are given in Appendix D. As shown in Fig. 7b, curriculum improves performance, particularly when easy examples make up a large proportion of the dataset, confirming that curricula that reduce clutter can benefit learning.

Conclusions
We analysed a model of curriculum learning introduced by [12] and amenable of analytical treatment.This simple setting sheds light on results observed in the cognitive science and machine learning literature, and the theoretical tractability allows for exploration of a wide range of parameters that would be costly to obtain through experiments.Future work will need to move beyond models with simple loss landscapes to address the impact of curricula in complex tasks like reinforcement learning.Nevertheless, the model recapitulates a variety of observations in the literature [50,56,57], revealing that easy-to-hard effects can appear when a sparse signal is embedded in many irrelevant dimensions of variation.We find that making the algorithm curriculum-aware by modifying the loss can better exploit curricula, offering a potential route for improved practical algorithms.Other curriculum-aware approaches are possible such as adapting the learning algorithm [58] or the architecture [10].On the psychology side, our predictions can help in designing new experiments, for instance testing the counter-intuitive benefit of anti-curriculum learning for intermediate sparsity.
(A. 13) Where the expectation acts with respect to all the stochastic variables.In order to obtain explicit formulae we need to evaluate those averages.The random variables in the equations -λ r , λ i and ρare Gaussian with zero mean, to characterise them we only need their covariance: In order to derive analytical expression we must evaluate the expected values: and Where σ is the activation function of the student and φ is the activation function of the teacher (in particular φ(•) = sign(•) for classification).
Finally, we can substitute those equations into the Eqs.(A.10-A.12)and obtained the state evolution equations used in the main Sec.3: this two additional random variables need to be averaged together with the others.The joint distribution of λ r , λ i , λr , λi , ρ is still Gaussian with zero mean and covariance Notice that, a part from a slight change of the existing equations, the coupling introduces only two additional integrals E[δ σ (λ r + λ i ) λr ] and E[δ σ (λ r + λ i ) λi ].After long, but straightforward,  computations we obtain

B Replica computation for the batch case
We here the detailed replica computation employed to obtain the analytic description of curriculum learning in the batch case, in section 4. As mentioned in the main, we aim to study a coupled system, represented by the following partition function: This type of quantity is usually denoted as a "disordered" partition function in statistical physics jargon, meaning that it is still dependent on a given realisation of the datasets -i.e., the source of disorder in this model.We want to characterise a typical realisation of this object, in the highdimensional limit.However, because of its long-tailed statistics, the partition function turns out not to be a self-averaging quantity, i.e. its expectation over the dataset realisations will not correspond to the typical case scenario we are after.It is instead better to focus on the computation of the associated average free-entropy: What is immediately apparent is that we have to take the expectation of a logarithm, which is not tractable with rigorous mathematical methods.Moreover, we also have to average over the measure for W W W 1 , which is also a complicated operation.
Fortunately, replica theory offers a method for approaching this calculation [47,48].The idea is to exploit two separate replica tricks: • in order to evaluate the disorder average, the logarithm can be removed by replicating the second weight configuration, i.e. introducing n identical replicas {W a 2 } n a=1 , and extrapolating the final result from the n → 0 limit.This is based on the mathematical identity log x = lim n→0 ∂ n x n .• the average over the teacher can instead be computed by introducing ñ − 1 non-interacting and a single interacting replica of the first weight configuration {w c 1 } ñ c=1 .Thus, only the c = 1 replica will enter the Gaussian prior in the student measure.The sought statistical average is again recovered in the limit ñ → 0.
Because of the high-dimensional limit we are considering, all typical realisations of the teacher vector with a given sparsity ρ will yield an identical free-entropy.Thus, we can avoid averaging and instead fix a gauge W W W T,i = 1 for i = 1, . . ., ρN and W W W T,i = 0 elsewhere.In order to simplify the presentation, in the following we will assume that the datasets contain respectively α 1 and α 2 patterns, and that a curriculum ordering was employed, ∆ 1 < ∆ 2 .Moreover, to avoid confusion with component and replica indices, we will denote with W W W = W W W 1 and W W W = W W W 2 , so that all quantities with a tilde refer to the optimisation on the first dataset.
After the described replication procedures, we get the following expression for the average freeentropy: where (y, ŷ) = log(1 + e −y ŷ ) indicates the standard logistic loss.The next step is to explicitly compute the averages over the dataset realisations.Before doing that, we need to isolate the dependence of our expression on the patterns, and we achieve this by introducing Dirac's δ-functions for the pre-activations.We will use the integral representation of the δ, with integration variables u for the teacher preactivations λ for the student preactivations: µ,a e − β 2 (sign(u2µ),σ(λ a 2µ )) .
Thus, the disorder average is now factorised and only involves exponential terms.Since the two datasets are independent now that we made the teacher explicit, we can take the averages over each one separately.In both cases we get: This expression suggests what are the order parameters that capture the interactions of the model, namely: • the teacher-student overlap at the end of the first learning phase: Rc = • the norm of the student after the second stage, decomposed into relevant/irrelevant parts: Therefore, after introducing these definitions by means of Dirac's δ-functions, we can rewrite our replicated expression as: Where we introduced interaction, entropic and energetic potentials: dλ a d λa 2π e iλ a λa (B.10) Replica Symmetric Ansatz The replica trick allowed us to express the average free-entropy as a function of the overlap order parameters.However, these objects are n × n matrices or n-dimensional vectors and in principle we have to average over all their possible realisations.Fortunately, the integrand function is exponential in N and in the thermodynamic limit N → ∞ the integrals are dominated by the extremisers of the action, and thus can be approximated with the saddle-point method.Still, we need a guess for how to parametrise these order parameters.The simplest possible ansatz, which turns out to be the correct one in convex problems as the one at hand, is the so-called Replica Symmetric ansatz, given by: We also perform a Wick rotation −i Qac,bd → Qac,bd in order to deal with real valued conjugate parameters and pose a similar ansatz for them.In the next paragraph we will compute the three terms separately, and finally put them together in the expression for the RS free-entropy.

Interaction term
We start by evaluating the interaction term, or better its normalised logarithm g i = lim ñ→0 log G i /(nN ): In order to recover the optimisation problems entailed in the curriculum procedure, we now have to consider the zero temperature limit of this expression.When β → ∞, the order parameters follow non-trivial scaling laws: and similarly for the tilde parameters.Intuitively, looking at the last scaling law, we see that as the measure gets focused on the single minimiser of the loss, the overlap between different replicas q rapidly converges to the norm Q.Moreover, the scaling with the inverse temperature of the conjugate parameters prevents the interaction term from becoming sub-dominant in the saddle-point.If we substitute the rescaled parameters in the above expression we obtain:

Entropic term
We can now compute a similar quantity for the entropic potential, g i = limn→0 n log G S R, R, Q, Q .The general expression we will obtain can be specialised to the two cases R, R, Qr , Q r , 0, 0, Qi , Q i appearing in the free-entropy.After substituting the RS ansatz we find: In the zero-temperature limit, we consider the same rescaling of the order parameters we described above.The integrals over the weights become an extremum operation: where: and where: Finally also the Dz Dz integrations can be carried out, giving: So, specialising to the the two terms that appear in the free-entropy we get:

Energetic term
Since one of the two energetic terms appearing in the replicated free-energy depends on the ñ replicas of the first weight configuration, and there is no interaction, we can take the ñ → 0 limit directly.Therefore we only have to evaluate the other contribution (dependent on the n replicas of the second weight configuration).Defining Q So in the β → ∞ limit, with the proper rescalings, we get: where:

RS Free-entropy
Finally, assuming the we can write down the RS free-entropy for the curriculum ordering as: where g S is defined in equation (B.21) and g E is defined in equation (B.25).The order parameters for the teacher system are obtained independently from identical equations, after substituting λ 1 → 0, λ 2 =→ λ 1 and λ 12 → 0, α 2 → α 1 and ∆ 2 → ∆ 1 , and after adding a tilde to the remaining parameters.
The saddle-point equations, yielding at convergence the asymptotic prediction for the order parameters, can be found by posing stationarity conditions for the free-entropy with respect to all overlaps.
Note that, if instead of the simple setting just considered, where the data slice in the second stage has homogeneous variance for the irrelevant components, there are multiple subsets with different sizes and variances, the only variation in the free-entropy is in the energetic contribution.In general one will have a sum: over each of these subsets.
Moreover, if instead of two stages we consider multiple learning stages, the free-entropy for each successive step has an identical form, and one only has to substitute the tilde parameters with the order parameters obtained at the previous step.Note that the simplicity of nesting stages in this problem is connected to the convexity of this learning setting.Generally, adding more steps would increase the complexity of the calculation considerably.

Generalisation error
With the saddle-point values for the order parameters, one can easily evaluate the generalisation error on new datapoints, which is the measure of performance we are employing in the main.This performance can be obtained as: where ∆ is the variance of the irrelevant components for the new pattern.A shortcut for evaluating this expression is to insert the order parameters in the expression through Dirac's δs.After a straightforward calculation, along the same lines of the one presented above, one obtains: Of course, the generalisation accuracy is just the complementary quantity 1 − g .

C Additional results on sparsity
We complement the discussion on the importance of sparsity, Sec. 4, with the comparison with other learning protocols.Observe that anti-curriculum suffers the same issue of the curriculum method for sufficiently large fractions of relevant features ρ.In that regime, the splitting becomes sub-optimal because the solution found in the splitting does not provide enough information to help the other phase of learning.Consequently, the network is forced to set neglect the information in the batch in favour of exploring solutions further away from that one.This is outperform by standard learning, where all the bits of information are used.

D Simulations on CIFAR10
Task design.Because a sparse set of relevant features is crucial to observing curriculum effects in our model, we created a task based on real data that has this property.In particular we create 32 × 64 pixel input examples by concatenating two images side-by-side from the CIFAR10 dataset.The correct output label is given by the label of the image on the left, while the image on the right is an irrelevant distractor.To vary difficulty, we scale the contrast of the irrelevant image.This dataset is meant to instantiate a simple example of learning an object classification amidst clutter.We emphasise that, as in our synthetic data model, each training sample always contains the same relevant and distractor images (i.e., we are not considering a data augmentation setting where each relevant image appears with many non-relevant images).To ensure no cross-contamination of training and testing samples, the distractor images for the training and test sets are drawn only from the same set.
Model architecture and training regime.We train a single layer network with cross entropy loss (i.e.softmax regression), implemented in Pytorch Lightning by modifying the MIT-licensed PyTorch_CIFAR10 repository (https://zenodo.org/record/4431043#.YLmz6zZKhsA) to ensure that training parameters accord with standard practice.Networks were trained with SGD and Nesterov momentum, under default parameters: a learning rate of 1e − 2, momentum parameter 0.9, batch size 256, and 100 epochs.The learning rate was annealed according to the 'WarmUpCosine' schedule used in PyTorch_CIFAR10, which linearly reduces the learning rate over the first 30% of training steps before switching to a cosine shaped schedule on the remainder.
Experiment details and hyperparameter optimisation.For the first phase of training, we used dataset sizes in 10 equal steps between 1000 and 50000.For the second phase, we used nine dataset sizes in 9 equal steps between 5333 and 48000.We optimised hyperparameters in each phase separately.In the first phase, we evaluated all combinations of initialisation scales of {0, .2,.5, 1.}, weight decay parameters of {0, .2,.5, 1., 2.}, and curriculum policy, for five random seeds.In the second phase, for each random seed and curriculum condition, we continued training from the bestperforming model obtained in the first phase.We trained all combinations of five elastic penalties log spaced between 1e − 3 and 1e2, and weight decay parameters {0, .2,.5}.We then compute the best performing model for each seed and take the mean over seeds.Finally, to evaluate the no-curriculum performance, we train shuffled dataset models with initialisation scales {0, .2,.5, 1.} and weight decay parameters {0, .2,.5}.For visualisation purposes, we used nearest-neighbors interpolation in the phase portrait to provide values for all points used in the synthetic experiments.Experiments were run on V100 GPUs and required approximately 10000 GPU hours (including debugging and development), or ≈ 1110 kg CO 2 eq according to the MachineLearning Impact calculator of Lacoste et al., 2019.

E Speed-up theory vs simulations
As remarked in the main text, one of the advantages of the theoretical analysis is a huge speed-up in the time to collect the results, without need of averaging to reduce the fluctuations.In this section, we briefly report a comparison between the time required for the lines from theory and simulations shown in the main text.

Figure 1 :
Figure1: Teacher-student setting for curriculum learning.(a) Illustration of teacher-student setting in which a "student" network is trained from i.i.d.inputs with labels from a "teacher" network.Since the teacher network is sparse, its output depends only on a subset of relevant input features.(b) We consider curricula which order examples by difficulty, here taken to be the variance in the irrelevant feature dimensions.We refer to increasing, decreasing, and random difficulty order as curriculum, anti-curriculum, and no curriculum, respectively.(c) Example test error on hard examples for the student over training.The switch-point between easy and hard samples lies at α = 1/2.Solid lines show numerical simulations, while dashed lines show theoretical predictions derived in Section 3.For this particular parameter setting, curriculum speeds learning but only modestly improves final performance at α = 1.Parameters: α 1 = 1, α 2 = 1, ∆ 1 = 0, ∆ 2 = 1, γ = 10 −5 , η = 3.
Anti-curriculum learning.
Curriculum vs anti-curriculum.

Figure 2 :
Figure 2: Phase diagram of online learning performance gap with optimal parameters.The colour scale shows the ratio of the accuracy on hard instances reached by curriculum over nocurriculum (a), anti-curriculum over no-curriculum (b), and curriculum over anti-curriculum (c), as a function of the total dataset size (α 1 + α 2 ) and easy dataset size (α 1 ).Curriculum broadly benefits performance and anti-curriculum is effective in certain regions, but the size of the improvement is modest.Parameters: ρ = 0.50, ∆ 1 = 0, ∆ 2 = 1.

Figure 3 :
Figure 3: Performance gap starting from high initialisation norm.The first two figures show the accuracy-gap on hard instances between curriculum learning and the baseline (a) and anti-curriculum learning and the baseline (b).Contrary to the phase diagram in Fig.2, curriculum learning is not always the optimal and anti-curriculum is not always the worst strategy.The right panel shows the accuracy evaluated on the hard samples for α 1 = α 2 = 0.5.

Figure 4 :
Figure 4: Effect of elastic coupling (Gaussian prior) between curriculum phases.(a) comparison between asymptotic performance of curricula (full lines) and single batch learning, at α 1 = 1 α 2 = 1, with a regularisation γ 1 that yields the best generalisation when learning the entire dataset (in principle not optimal for the other strategies).The points represent the results from 10 numerical simulations at size N = 2000.Parameters: ρ = 0.50, ∆ 1 = 0 and ∆ 2 = 1.(b) ratio between the accuracy reached by curriculum learning over anti-curriculum as a function of the number of easy samples in a dataset of dimension α 1 + α 2 = 1, and of the sparsity level of the teacher ρ.Note that ρ can also be seen as the fraction of relevant components in the inputs.∆ 1 = 0 and ∆ 1 = 1.γ 1 = γ 2 and γ 12 where set the values that optimise test performance.
Anti-curriculum learning.
Curriculum vs anti-curriculum.

Figure 5 :
Figure 5: Phase diagram for the performance gap in the batch setting.The colour scale shows the ratio of the accuracy on hard instances for curriculum over no-curriculum (a), anti-curriculum over no-curriculum (b), and curriculum over anti-curriculum (c), as a function of the total dataset size (α 1 + α 2 ) and easy dataset size (α 1 ).In contrast to the online case, performance benefits are greater and curriculum is strictly better than anti-curriculum.Both γ 1 = γ 2 and γ 12 are optimised point-wise, in order to yield the best test accuracy.Parameters: ρ = 0.50, ∆ 1 = 0, ∆ 2 = 1.

Figure 6 :
Figure 6: Connection with psychology experiments.(a) Top: Accuracy ratio of different strategies in the model, with curriculum/no-curriculum in green and curriculum/anti-curriculum in orange.The ratio shows non-monotonic behaviour.Bottom: The accuracy ratio obtained by[50].Parameters ρ = 0.5, ∆ 1 = 0.0, ∆ 2 = 1.0, α 1 = 1, α 2 = 1 and optimal learning rate, norm at initialisation and weight decay intensity.(b) Top: Dependence on the sparsity of the generalisation gain of curriculum over no-curriculum, measured as ratio between final accuracy, for fixed total dataset size (α 1 + α 2 = 1).Bottom: The ratio obtained from experiments 3 and 4 of[6].(c) Example cartoon stimuli from the "fading" paradigm used in[6], where participants distinguish daemons of the old world from daemons of the new world.The distinguishing feature (horn length) is diluted among many irrelevant features (colour, eye size, mouth size).Highlighting the relevant feature to participants leads to better and faster learning.

Figure 7 :
Figure 7: Experimental setting on CIFAR10-derived data.(a) Input samples combine a taskrelevant image with a distractor image, and become progressively harder from left to right.(b) Ratio between final accuracy on hard instances for curriculum learning versus no curriculum.η, γ, γ 12 , init, and stopping time are optimised.

Figure A. 1 :
Figure A.1: Effect of elastic coupling in the curriculum.Figures showing the teacher-student cosine, the validation loss, and the accuracy of the three learning strategies.The two figures show the performance in presence (above) and absence (below) of elastic coupling.The dashed lines are obtained from the theoretical analysis, the full line come from the average of 500 simulations.The parameters η, γ, initialisation are set to the optimal values for each protocol.Parameters: ρ = 0.5, α 1 = 0.2, α 2 = 0.2, ∆ 1 = 0, ∆ 2 = 1.

32 )
Finally all the expected values are known and we can obtain the analytic updates Eqs.(A.26-A.30)with the coupling.Fig. A.1a shows an instance of the problem at α 1 = 0.2 and α 2 = 0.2, a situation that is particularly adversarial for curriculum according the phase diagram Fig. 2.This situation is treated by the introduction of Gaussian priors, Fig. A.1b, consistently with the phase diagram in Fig. 7c.
where the examples is D 1 , D 2 are characterised by a different variances in the irrelevant components.
-student overlap at the end of the second learning phase: R a = ρN i=1 W a i N • the norm of the student after the first stage, decomposed into relevant/irrelevant parts: Qcd r = ρN i=1 Curriculum vs anti-curriculum.

Figure C. 1 :
Figure C.1: Effect of sparsity.Phase diagram on the effect of sparsity, Fig. 4b, extended for all learning protocols.