The dynamics of representation learning in shallow, non-linear autoencoders

Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.


Introduction
Autoencoders are a class of neural networks trained to reconstruct their inputs by minimising the distance between an input x ∈ X , and the network output x ∈ X .The key idea for learning good features with autoencoders is to make the intermediate layer in the middle of the network (significantly) smaller than the input dimension.This "bottleneck" forces the network to develop a compressed representation of its inputs, together with an encoder and a decoder.Autoencoders of this type are called undercomplete, see fig. 1, and provide an ideal framework to study feature learning, which appears to be a key property of deep learning, yet remains poorly understood from a theoretical point of view.
How well can shallow autoencoders with a single hidden layer of neurons perform on a reconstruction task?For linear autoencoders, Eckart & Young (1936) established that the optimal reconstruction error is obtained by a network whose weights are given by (a rotation of) the K leading principal components of the data, i.e. the K leading eigenvectors of the input covariance matrix.The reconstruction error of such a network is therefore called the PCA error.How linear autoencoders learn these components when trained using stochastic gradient descent (SGD) was first studied by Bourlard & Kamp (1988) and Baldi & Hornik (1989).A series of recent work analysed linear autoencoders in more detail, focusing on the loss landscape (Kunin et al., 2019), the convergence and implicit bias of SGD dynamics (Pretorius et al., 2018;Saxe et al., 2019;Nguyen et al., 2019), and on recovering the exact principal components (Gidel et al., 2019;Bao et al., 2020;Oftadeh et al., 2020).
Adding a non-linearity to autoencoders is a key step on the way to more realistic models of neural networks, which crucially rely on non-linearities in their hidden layers to extract good features from data.On the other hand, non-linearities pose a challenge for theoretical analysis.In the context of high-dimensional models of learning, an additional complication arises from analysing non-linear models trained on inputs with non-trivial correlations, which are a fundamental prerequisite for unsupervised learning.The first step in this direction was taken by Nguyen (2021), who analysed overcomplete autoencoders where the number of hidden neurons grows polynomially with the input dimension, focusing on the special case of weight-tied autoencoders, where encoder and decoder have the same weights.
Here, we leverage recent developments in the analysis of supervised learning on complex input distributions to study the learning dynamics of undercomplete, non-linear arXiv:2201.02115v2[stat.ML] 16 Jun 2022 autoencoders with both tied and untied weights.While their reconstruction error is also limited by the PCA reconstruction error (Bourlard & Kamp, 1988;Baldi & Hornik, 1989), we will see that their learning dynamics is rich and raises a series of questions: can non-linear AE trained with SGD reach the PCA error, and if so, how do they do it?How do the representations they learn depend on the architecture of the network, or the learning rule used to train them?
We describe our setup in detail in section 2. Our main contributions can be summarised as follows: 1. We derive a set of asymptotically exact equations that describe the generalisation dynamics of shallow, nonlinear autoencoders trained using online stochastic gradient descent (SGD) (section 3.1); 2. Using these equations, we show how autoencoders learn important features of their data sequentially (3.2); we analyse the long-time dynamics of learning and highlight the importance of untying the weights of decoder and encoder (3.3); and we show that training the biases is necessary for ReLU autoencoders; (3.4) 3. We illustrate how a modification of vanilla SGD breaking the rotational symmetry of neurons yields the exact principal components of the data (section 3.5); 4. We finally show that the equations also capture the learning dynamics of non-linear autoencoders trained on more realistic datasets such as CIFAR10 with great accuracy, and discuss connections with recent results on Gaussian universality in neural network models in section 4.
Reproducibility We provide code to solve the dynamical equations of section 3.1 and to reproduce our plots at https://github.com/mariaref/NonLinearShallowAE.

Setup
Architecture We study the performance and learning dynamics of shallow non-linear autoencoders with K neurons in the hidden layer.Given a D-dimensional input x = (x i ), the output of the autoencoder is given by where w k , v k ∈ R D are the encoder and decoder weight of the kth hidden neuron, resp., and g(•) is a non-linear function.We study this model in the thermodynamic limit where we let the input dimension D → ∞ while keeping the  3), which can be divided into a bulk and a finite number of outliers.
Bottom: example inputs drawn from CIFAR10 (Krizhevsky et al., 2009), a benchmark data set we use for our experiments with realistic data in section 4.
number of hidden neurons K finite, as shown in fig. 1 (a).
The performance of a given autoencoder is measured by the population reconstruction mean-squared error, where the expectation E is taken with respect to the data distribution.Nguyen (2021) analysed shallow autoencoders in a complementary mean-field limit, where the number of hidden neurons K grows polynomially with the input dimension.Here, by focusing on the case K < D, we study the case where the hidden layer is a bottleneck.Nguyen (2021) considers tied autoencoders, where a single set of weights is shared between encoder and decoder, w k = v k .We will see in section 3.3 that untying the weights is critical to achieve a non-trivial reconstruction error in the regime K < D.

Data model
We derive our theoretical results for inputs drawn from a spiked Wishart model (Wishart, 1928;Potters & Bouchaud, 2020), which has been widely studied in statistics to analyse the performance of unsupervised learning algorithms.In this model, inputs are sampled according to where µ is an index that runs over the inputs in the training set, ξ µ is a D-dimensional vector whose elements are drawn i.i.d.from a standard Gaussian distribution and σ > 0.
The matrix A ∈ R D×M has elements of order O(1) and is fixed for all inputs, while we sample c µ ∈ R M from some distribution.Different choices for A and the distribution of c allow modelling of Gaussian mixtures, sparse codes, and non-negative sparse coding.

Spectral properties of the inputs
The spectrum of the covariance of the inputs is determined by A and c.It governs the dynamics and the performance of autoencoders (by construction, we have E x i = 0).If we set c µ = 0 in eq. ( 3), inputs would be i.i.d.Gaussian vectors with an average covariance of Ex i x j = δ ij , where δ ij is the Kronecker delta.The empirical covariance matrix of a finite data set with P ∼ O(D) samples has a Marchenko-Pastur distribution with a scale controlled by σ (Marchenko & Pastur, 1967;Potters & Bouchaud, 2020).By sampling the mth element of c as c m ∼ N (0, ρm ), we ensure that the inputinput covariance has M eigenvalues ρ m = D ρm with the columns of A as the corresponding eigenvectors.The remaining D − M eigenvalues of the empirical covariance, i.e. the bulk, still follow the Marchenko-Pastur distribution.
We further ensure that the reconstruction task for the autoencoder is well-posed by letting the outlier eigenvalues scale as ρ m ∼ O(D) (or ρm ∼ O(1)), resulting in spectra such as the one shown in fig.1(b).If the largest eigenvalues were of order one, an autoencoder with a finite number of neurons K could not obtain a pmse better than random guessing as the input dimension D → ∞.
Training We train the autoencoder on the quadratic error in the one-pass (or online) limit of stochastic gradient descent (SGD), where a previously unseen input x is presented to the network at each step of SGD.This limit is extensively used to analyse the dynamics of non-linear networks for both supervised and unsupervised learning, and it has been shown experimentally to capture the dynamics of deep networks trained with data augmentation well (Nakkiran et al., 2021).The SGD weight increments for the network parameters are then given by: where ∆ i ≡ (x i − x i ) and κ ≥ 0 is the 2 regularisation constant.In order to obtain a well defined limit in the limit D → ∞, we rescale the learning rates as η W = η/D, η V = η for some fixed η > 0.

Dynamical equations to describe feature learning
The starting point for deriving the dynamical equations is the observation that the pmse defined in eq. ( 2) depends on the inputs only via their low-dimensional projections on the network's weights This allows us to replace the high-dimensional expectation over the inputs in eq. ( 2) with a K-dimensional expectation over the local fields λ = (λ k ) and ν = (ν k ).For Gaussian inputs drawn from (3), (λ, ν) are jointly Gaussian and their distribution is entirely characterised by their second moments: where Ω ij = E x i x j is the covariance of the inputs.These macroscopic overlaps, often referred to as order parameters in the statistical physics of learning, together with , are sufficient to evaluate the pmse.To analyse the dynamics of learning, the goal is then to derive a closed set of differential equations which describe how the order parameters evolve as the network is trained using SGD (4).Solving these equations then yields performance of the network at all times.Note that feature learning requires the weights of the encoder and the decoder move far away from their initial values.Hence, we cannot resort to the framework of the neural tangent kernel or lazy learning (Jacot et al., 2018;Chizat et al., 2019).
Instead, here we build on an approach that was pioneered by Saad & Solla (1995a;b) and Riegler & Biehl (1995) to analyse supervised learning in two-layer networks with uncorrelated inputs; Saad (2009) provides a summary of the ensuing work.We face two new challenges when analysing autoencoders in this setup.First, autoencoders cannot be formulated in the usual teacher-student setup, where a student network is trained on inputs whose label is provided by a fixed teacher network, because it is impossible to choose a teacher autoencoder which exactly implements the "identify function" x → x with K < D. Likewise, a teacher autoencoder with random weights, the typical choice in supervised learning, is not an option since such an autoencoder does not implement the identity function, either.Instead, we bypass introducing a teacher and develop a description starting from the data properties.The second challenge is then to handle the non-trivial correlations in the inputs (3), where we build on recent advances in supervised learning on structured data (Yoshida & Okada, 2019;Goldt et al., 2020;2022;Refinetti et al., 2021).

DERIVATION: A SKETCH
Here, we sketch the derivation of the equation of motion for the order parameter T 1 ; a complete derivation for T 1 and all other order parameters is given in appendix B. We define the eigen-decomposition of the covariance matrix The dynamics of representation learning in shallow, non-linear autoencoders Ω rs = 1 /D D τ =1 Γ sτ Γ rτ ρ τ and the rotation of any vector z ∈ {w k , v k , x} onto this basis as z τ ≡ 1 / √ D D s=1 Γ τ s z s .The eigenvectors are normalised as i Γ τ i Γ τ i = Dδ τ τ and τ Γ τ i Γ τ j = Dδ ij .Using this decomposition, we can re-write T 1 and its update at step µ as: We introduce the order-parameter density t(ρ, s) that depends on ρ and on the normalised number of steps s = µ /D, which we interpret as a continuous time variable in the limit D → ∞, where 1(.) is the indicator function and the limit ρ → 0 is taken after the thermodynamic limit D → ∞.Similarly r(ρ, s) and q(ρ, s) are introduced as order parameter densities corresponding to R 1 and Q 1 respectively.Now, inserting the update for v (4) in the expression of t k , and evaluating the expectation over a fresh sample x, we obtain the dynamical equation: where we write k ↔ to denote the same term with the indices k and permuted.The order parameter is recovered by integrating the density t k (ρ, s) over the spectral density µ Ω of the covariance matrix: T k 1 = dµ Ω (ρ) ρ /D t k (ρ, s).We can finally close the equations by evaluating the remaining averages such as E λ k g(λ k ).For some activation functions, such as the ReLU or sigmoidal activation g(x) = erf ( x / √ 2), these integrals have an analytical closed form.In any case, the averages only depend on the second moments of the Gaussian random variables (λ, ν), which are given by the order parameters, allowing us to close the equations on the order parameters.
We derived these equations using heuristic methods from statistical physics.While we conjecture that they are exact, we leave it to future work to prove their asymptotic correctness in the limit D → ∞.We anticipate that techniques to prove the correctness of dynamical equations for supervised learning Goldt et al. (2019); Veiga et al. (2022) should extend to this case after controlling for the spectral dependence of order parameters.

SEPARATION BETWEEN BULK MODES AND PRINCIPAL COMPONENTS
We can exploit the separation of bulk and outlier eigenvalues to decompose the integral (9) into an integral over the bulk of the spectrum and a sum over the outliers.This results in a decomposition of the overlap T 1 as: and likewise for other order parameters.Leveraging the fact that the bulk eigenvalues are of order O(1), and that we take the D → ∞ limit, we can write the equation for T bulk as: On the other hand, the M outlier eigenvalues, i.e. the spikes, are of order O(D).We thus need to keep track of all the terms in their equations of motion.Similarly to eq. ( 10), they are written: Note that the evolution of {t i } i=1,..,M is coupled to the one of T bulk only through the expectations of the local fields.A similar derivation, which we defer to appendix B.3, can be carried out for R 1 and Q 1 .

EMPIRICAL VERIFICATION
In fig.2, we plot the pmse of an autoencoder with K = 3 hidden neurons during training on a dataset drawn from eq. ( 3) with M = 3.We plot the pmse both obtained from training the network using SGD (4) (crosses) and obtained from integrating the equations of motion that we just derived (lines).The agreement between the theoretical description in terms of bulk and spike modes captures the training dynamics well, even at an intermediate input dimension of D = 1000.In the following, we analyse these equations, and hence the behaviour of the autoencoder during training, in more detail.

Nonlinear autoencoders learn principal components sequentially
The dynamical equations of the encoder-encoder overlap q k , eq. (B.28), takes the form The learning rate of each mode of q k is thus rescaled by the corresponding eigenvalue, suggesting that the principal components that correspond to each to each mode are learnt at different speeds, with the modes corresponding to the largest eigenvalues being learnt the fastest.For the other order parameters r k and t k , all the terms in the equations of motion eq.(B.24) and eq.(B.29) are rescaled by ρ except for one: Indeed, the pmse of the sigmoidal autoencoder shown in fig. 2 goes through several sudden decreases, preceded by plateaus with quasi-stationary pmse.Each transition is related to the network "picking up" an additional principal component.By comparing the error of the network on the plateaus to the PCA reconstruction error with an increasing number of principal components (solid horizontal lines in fig.2), we confirm that the plateaus correspond to stationary points where the network has learnt (a rotation of) the first leading principal components.Whether or not the plateaus are clearly visible in the curves depends on the separation between the eigenvalues of the leading eigenvectors.
This sequential learning of principal components has also been observed in several other models of learning.It appears in unsupervised learning rules such as Sanger's rule (Sanger, 1989)  The evolution of the bulk order parameters obeys As a consequence, in the early stages of training, to first approximation, the bulk component of R 1 and T 1 follow an exponential decay towards 0. The characteristic time of this decay is given by the expectation E g(λ k )g(λ )| s=0 which only depends on Q 1 | s=0 .In addition, the last equation implies that the evolution of Q 1 is entirely determined by the dynamics of the spike modes.

The long-time dynamics of learning: align, then rescale
At long training times, we expect a sigmoidal autoencoder with untied weights to have retrieved the leading PCA subspace since it achieves the PCA reconstruction error.This motivates an ansatz for the dynamical equations where the network weights are proportional to the eigenvectors of the covariance Γ k , where α ω , α v ∈ R K are scaling constants that evolve in time.With this ansatz, the order parameters are diagonal, and we are left with a reduced set of 2K equations describing the dynamics of α k w and α k v which are given in eq.(C.8) together with their detailed derivation.
We first verify the validity of the reduced equations (C.8) to describe the long-time dynamics of sigmoidal autoencoders.In fig.3(a), we show the generalisation dynamics of a sigmoidal autoencoder starting from random initial conditions (blue).We see two phases of learning: after an exponential decay of the pmse up to time t ∼ 10, the pmse decays as a power-law.This latter decay is well captured by the evolution of the pmse of a model which we initialised with weights aligned to the leading PCs.We deduce that during the exponential decay of the error, the weights align to the leading PCA directions and stay aligned during the power-law decay of the error.
After having recovered the suitable subspace, the network adjusts the norm of its weights to decrease the error.A (2) of a sigmoidal autoencoder with a single hidden neuron starting from random initial conditions (blue).The error first decays exponentially, then as a power law.During this second phase of learning, the pmse is identical to the pmse of an autoencoder whose weights are proportional to the leading eigenvector Γ of the input covariance at all times: w(t) ∝ αw(t)Γ, v(t) ∝ αv(t)Γ, cf.eq. ( 19).(b) Dynamics of the scaling constants αw and αv obtained through simulations of an autoencoder starting from random weights (crosses) and from integration of a reduced set of dynamical equations, eq.(C.8).(c,d) The norm of the encoder (left) and decoder (right) weights.Their weights shrink, respectively grow, as a power law √ ρ(ηt) ±δ with exponent δ 1/6.This allows the sigmoidal network to remain in the linear region of its activation function and to achieve PCA performance.Solid lines are obtained from integration of a reduced set of dynamical equations, eq.(C.8), crosses are from simulations for various combinations of learning rates and leading eigenvalue ρ (different colours). Parameters: look at the dynamics of the scale parameters α w and α v during this second phase of learning reveals that the encoder's weights shrink, while the decoder's weights grow, cf.fig.3(b).Together, these changes lead to the power-law decay of the pmse.Note that we chose a dataset with only one leading eigenvalue so as to guarantee that the AE recovers the leading principal component, rather than a rotation of them.A scaling analysis of the evolution of the scaling constants α k w and α k v shows that the weights decay, respectively grow, as a power-law with time, α k w ∝ 1 / √ ρ (ηt) −δ and α k v ∝ 1 / √ ρ (ηt) δ , see fig.3(c-d).We can understand this behaviour by recalling the linearity of g(x) = erf ( x / √ 2) around the origin, where g(x) ∼ 2 /π x.By shrinking the encoder's weights, the autoencoder thus performs a linear transformation of its inputs, despite the non-linearity.It recovers the linear performance of a network given by PCA if the decoder's weights grow correspondingly for the reconstructed input x to have the same norm as the input x.
We can find a relation between the scaling constants α k v and α k w at long times using the ansatz eq. ( 19) in the expression of the pmse: where the second term is simply the rank K PCA error.
The first term should be minimised to achieve PCA error, i.e. f (α k w , α k * v ) = 0 for all k.For a linear autoencoder, we find α k * v = 1 /α k w : as expected, any rescaling of the encoder's weights needs to be compensated in the decoder.For a sigmoidal autoencoder instead, we find Note, that at small α k w we recover the linear scaling The importance of untying the weights for sigmoidal autoencoders The need to let the encoder and decoder weights grow, resp.shrink, to achieve PCA error makes learning impossible in sigmoidal autoencoders with tied weights, where v k = w k .Such an autoencoder evidently cannot perform the rescaling required to enter the linear regime of its activation function, and is therefore not able able to achieve PCA error.Even worse, the learning dynamics shown in fig.4(a) show that a sigmoidal autoencoder with tied weights (blue) hardly achieves a reconstruction error better than chance, in stark contrast to the same network with untied weights (red).

The importance of the bias in ReLU autoencoders
Sigmoidal autoencoders achieve the PCA error by exploiting the linear region of their activation function.We can hence expect that in order to successfully train a ReLU AE, which is linear for all positive arguments, it is necessary to add biases at the hidden layer.The error curves for ReLU autoencoders shown in fig. 4 large, positive biases at the end of training, fig.4(c).The large biases, however, result in a small residual error that negatively affects the final performance of a ReLU network when compared to the PCA performances, as can be seen from linearising the output of the ReLU autoencoder: where b k ∈ R is the bias of the kth neuron.

Breaking the symmetry of SGD yields the exact principal components
Linear autoencoders do not retrieve the exact principal components of the inputs, but some rotation of them due to the rotational symmetry of the square loss (Bishop, 2006).While this does not affect performance, it would still be desirable to obtain the leading eigenvectors exactly to identify the important features in the data.
The symmetry between neurons can be broken in a number of ways.In linear autoencoders, previous work considered applying different amounts of regularisation to each neuron (Mianjy et al., 2018;Plaut, 2018;Kunin et al., 2019;Bao et al., 2020) or optimising a modified loss (Oftadeh et al., 2020).In unsupervised learning, Sanger's rule (Sanger, 1989) breaks the symmetry by making the update of the weights of the first neuron independent of all other neurons; the update of the second neuron only depends on the first neuron, etc.This idea was extended to linear autoencoders by Oftadeh et al. (2020); Bao et al. (2020).We can easily extend this approach to non-linear autoencoders by training the decoder weights following where the sum now only runs up to k instead of K.The update for the encoder weights w k is unchanged from eq. ( 4), but the error term is changed to . We can appreciate the effect of this change by considering a fixed point {W * , V * } of the vanilla SGD updates eq. ( 4) with dw * k = 0 and dv * k = 0. Multiplying this solution by an orthogonal matrix O ∈ O(K) i.e. { W, Ṽ} = {OW * , OV * } yields another fixed point, since for the decoder, and likewise for the encoder weights.For truncated SGD (23), the underbrace term in eq. ( 24) is no longer the identity and the rotated weights { W, Ṽ} are no longer a fixed point.
We show the pmse of a sigmoidal autoencoder trained with vanilla and truncated SGD in fig.5(a), together with the theoretical prediction from a modified set of dynamical equations.The theory captures both learning curves perfectly, and we can see that using truncated SGD incurs no performance penalty -both autoencoders achieve the PCA reconstruction error.Note that the reconstruction error can increase temporarily while training with the truncated algorithm since the updates do not minimise the square loss directly.In this experiment, we trained the networks on a synthetic dataset where the eigenvectors are chosen to be sinusoidal (black lines in fig. 5 b).As desired, the weights of the network trained with truncated SGD converges to the exact principal components, while vanilla SGD only recovers a linear combination of them.
The advantage of recovering the exact principal components is illustrated with the partial reconstructions of a test image from the FashionMNIST database (Xiao et al., 2017).We train autoencoders with K = 64 neurons to reconstruct FashionMNIST images using the vanilla and truncated SGD algorithms until convergence.We show the reconstruction using all the neurons of the networks in the left column of fig.5(d).Since the networks achieve essentially the same performance, both reconstructions look similar.The partial reconstruction using only the first five neurons of the networks shown on the right are very different: while the partial reconstruction from truncated SGD is close to the original image, the reconstruction of the vanilla autoencoder shows traces of pants mixed with those of a sweatshirt.

Representation learning on realistic data under standard conditions
We have derived a series of results on the training of non-linear AEs derived in the case of synthetic, Gaussian datasets.Most of our simplifying assumptions are not valid on real datasets, because they contain only a finite number of samples, and because these samples are finite-dimensional and not normally distributed.Furthermore, we had to choose certain scalings of the pre-activations (5) and the learning rates to ensure the existence of a well-defined asymptotic limit D → ∞.Here, we show that the dynamical equations of section 3.1 describe the dynamics of an autoencoder trained on more realistic images rather well, even if we don't perform any explicit rescalings in the simulations.
Realistic data In fig.6, we show the pmse of autoencoders of increasing size trained on 10k greyscale images from CIFAR10 (Krizhevsky et al., 2009) with crosses.The solid lines are obtained from integration the equations of motion of appendix B, using the empirical covariance of the training images.The accuracy of the equations to describe the learning dynamics at all times implies that the low-dimensional projections λ and ν introduced in eq. ( 5) are very close to being normally distributed.This is by no means obvious, since the images clearly do not follow a normal distribution and the weights of the autoencoder are obtained from training on said images, making them correlated with the data.Instead, fig.6 suggests that from the point of view of a shallow, sigmoidal autoencoder, the inputs can be replaced by Gaussian vectors with the same mean and covariance without changing the dynamics of the autoencoder.In fig.8 in the appendix, we show that this behaviour persists also for ReLU autoencoders and, as would be expected, for linear autoencoders.Nguyen (2021) observed a similar phenomenon for tied-weight autoencoders with a number of hidden neurons that grows polynomially with the input dimension D. We also verified that several insights developed in the case of Gaussian data carry over to real data.In particular, we demonstrate in fig. 9 that sigmoidal AE require untied weights to learn, ReLU networks require adding biases to the encoder weights, and that the principal components of realistic data are also learnt sequentially.
Relation to the Gaussian equivalence The fact that the learning dynamics of the autoencoders trained on CIFAR10 are well captured by an equivalent Gaussian model is an example of the Gaussian equivalence principle, or Gaussian universality, that received a lot of attention recently in the context of supervised learning with random features (Liao & Couillet, 2018;Seddik et al., 2019;Mei & Montanari, 2021) or one-and two-layer neural networks (Hu & Lu, 2020;Goldt et al., 2020;2022;Loureiro et al., 2021).These works showed that the performance of these different neural networks is asymptotically well captured by an appropriately chosen Gaussian model for the data.While previous work on Gaussian equivalence in two-layer networks crucially assumes that the network has a number of hidden neurons K that is small compared to the input dimension, here we find that the Gaussian equivalence of autoencoders extends to essentially any number of hidden neurons, for example K = D/2, as we show in fig.8 in the appendix.

Robustness of the dynamical equations to rescalings
The scaling of the local fields in eq. ( 5), and hence of the order parameters, is necessary to ensure that in the highdimensional limit D → ∞, all three terms in the pmse are non-vanishing.As we describe in the derivation, the rescaling of the learning rates are then necessary to obtain a set of closed equations.Nevertheless, we verified that the conclusions of the paper are not an artefact of these scalings.To this end, we trained a sigmoidal AE xi = K k v k i g(w k x) on grayscale CIFAR10 using SGD without rescaling the eigenvalues of the inputs, pre-activations λ k = w k x, or the learning rates.As we show in fig.7, we recover the three key phenomena described in the paper: autoencoders with tied weights fail to achieve rank-1 PCA error (orange); autoencoders with untied weights achieve rank-K PCA error (blue); and for sigmoidal autoencoders, the encoder weights w shrink, while decoder weights v grow to reach the linear regime (inset).These results underline that our theoretical results describe AEs under "standard conditions".

A. Online learning algorithms for PCA
We briefly review a number of unsupervised learning algorithms for principal component analysis, leading to Sanger's rule which is the inspiration for the truncated SGD algorithm of section 3.5.
Setup We consider inputs x = (x i ), i = 1, . . ., D with first two moments equal to We work in the thermodynamic limit D → ∞.

Hebbian learning
The Hebbian learning rule allows to obtain an estimate of the leading PC by considering a linear linear model y = λ = i wixi / √ D with i = 1, . . ., D and the loss function L(λ) = −λ 2 .It updates the weights as: In other words, the Hebbian learning rule tries to maximise the second moment of the pre-activation λ.That using this update we obtain an estimate of the first principal component makes intuitive sense.Consider the average over the inputs of the loss: If the weight vector w is equal to the τ th eigenvector of Ω, then E L(λ) = −ρ τ /D, and the loss is minimised by converging to the eigenvector with the larges eigenvalue.Also note, that similar to what we find in the main text, this result also suggests that the leading eigenvalue in the large-D limit should scale as ρ τ ∼ D. The Hebbian rule has the well-known deficit that it imposes no bound on the growth of the weight vector.A natural remedy is to introduce some form of weight decay such that Oja's rule (Oja, 1982) offers a smart choice for the weight decay constant κ.Consider the same linear model y = λ trained with a Hebbian rule and weight decay κ = y 2 : The purpose of this choice can be appreciated from substituting in the linear model and setting the update rule to zero, which yields (dropping the constants) which is precisely the eigenvalue equation for Ω.
Then, Oja's update rule is then derived from Hebb's rule by adding normalisation to the Hebbian update (A.2), with p integer (in Oja's original paper, p = 2.) This learning rule can be seen as a power iteration update.Substituting for y, we see that the numerator corresponds to, on average, repeated multiplication of the weight vector with the average covariance matrix.Expanding eq.(A.7) around η = 0 yields Oja's algorithm (while also assuming that the weight vector is normalised to one).One can show that Oja's rule indeed converges to the PC with the largest eigenvalue by allowing the learning rate to vary with time and imposing the mild conditions Sanger's rule (Sanger, 1989) is also known as generalised Hebbian learning in the literature and extends the idea behind Oja's rule to networks with multiple outputs y k = w k i x i .It allows estimation of the first few leading eigenvectors by using the update rule is given by Note that the second sum extends only to the k; the dynamics of the kth input vector hence only depends on weight vectors w with < k.This dependence is one of the similarities of Sanger's rule with the Gram-Schmidt procedure for orthogonalising a set of vectors (cf.sec.0.6.4 of Horn & Johnson (2012)).The dynamics of Sanger's rule in the online setting were studied by Biehl & Schlösser (1998); Schlösser et al. (1999).Sanger's rule reduces to Oja's rule for K = 1.

B. Online learning in autoencoders
Here we derive the set of equations tracking the dynamics of shallow non-linear autoencoders trained in the one-pass limit of stochastic gradient descent (SGD).These equations are derived for Gaussian inputs x ∈ R D drawn from the generative model eq.( 3), but as we discuss in section 4, they also capture accurately the dynamics of training on real data.

B.1. Statics
The starting point of the analysis is the definition of the test error (2) and the identification of order parameters, i.e. low dimensional overlaps of high dimensional vectors.
pmse ≡ First we introduce the local fields corresponding to the encoder and decoder's weights, as well as the order parameter tracking the overlap between the decoder weights: Note the unusual scaling of ν k with D; the intuition here is that in shallow autoencoders, the second-layer weights will be strongly correlated with the eigenvectors of the input-input covariance, and hence with the inputs, requiring the scaling 1/D instead of 1/ √ D. This scaling is also the one that yields a set of self-consistent equations in the limit D → ∞.The generalisation error becomes Crucially, the local fields λ and ν are jointly Gaussian since the inputs are Gaussian.Since we have E λ k = E ν k = 0, the pmse can be written in terms of the second moments of these fields only: The full expression of E g(λ k )g(λ ) and E ν k g(λ k ) in terms of the order parameters is given, for various activation functions, in App D. Note the different scalings of these overlaps with D. These are a direct consequence of the different scaling of the local fields λ k and ν k .In order to derive the equations of motion, it is useful to introduce the decomposition of Ω in its eigenbasis: The eigenvectors Γ sτ are normalised as i Γ τ i Γ τ i = Dδ τ τ and τ Γ τ i Γ τ j = Dδ ij .We further define the rotation of any The order parameters are then written:

B.2. Averages over non-linear functions of weakly correlated variables
We briefly recall expressions to compute averages over non-linear functions of weakly correlated random variables that were also used by Goldt et al. (2020).These expressions will be extensively used in the following derivation.Consider x, y ∈ R, two weakly correlated jointly Gaussian random variables with covariance matrix with 1 ∈ R which encodes the weak correlation between x and y.Then, the expectation of the product of any two real valued functions f, h : R → R, is given, to leading order in by: Similarly, for 3 random variables {x i } with x 1 and x 2 weakly correlated with x 3 but not between each other, i.e.Cov(x 1 , x 2 ) = M 12 ∼ O(1), one can compute the expectation of product of three real valued functions f, g, h to leading order in as: (B.13)

B.3. Derivation of dynamical equations
At the µth step of training, the SGD update for the rotated weights reads To keep notation concise, we drop the weight decay term in the following analysis.We can now compute the update of the various order parameters by inserting the above into the definition of the order parameters.The stochastic process described by the resulting equations concentrates, in the D → ∞ limit, to its expectation over the input distribution.By performing the average over a fresh sample x, we therefore obtain a closed set of deterministic ODEs tracking the dynamics of training in the high-dimensional limit.We also show in simulations that these are able to capture well the dynamics also at finite dimension D ∼ 500.
Update of T 0 the overlap of the decoder's weights Let us start with the update equation for T k 0 = 1 /D τ v k τ v τ which is easily obtained by using the sgd update (B.14).
By taking the expectation over a fresh sample x and the D → ∞ limit of one-pass SGD, we obtain a continuous in the normalised number of steps s = µ /D, which we interpret as a continuous time variable, as: where the notation (k ↔ ) indicates an additional term identical to the first except for exchange of the indices k and l.This equation requires to evaluate I 22 ≡ E g(λ k )g(λ k ), I 21 ≡ E ν g(λ k ), which are two-dimensional Gaussian integrals with covariance matrix given by the order parameters.We give their expression in appendix D. We also note that the second order term ∝ dv k dv is sub-leading in the high dimensional limit we work in.
Update of T 1 The update equation for τ is obtained similarly as before by using the SGD update (B.14) and taking the high-dimensional limit.
∂T k To make progress, we have to evaluate E g(λ k )x τ .For this purpose it is crucial to notice that the rotated inputs x τ are weakly correlated with the local fields: Then, we can compute the expectation E g(λ k )x τ using the results of appendix B.2, i.e.Eq. (B.12) with f (x) = g(x) and h(y) = y, thus obtaining: Inserting the above expression in the update of T 1 yields: Notice, that in this equation we have the the appearance of τ ρ 2 τ v τ w k τ D 5/2 , a term which cannot be simply expressed in terms of order parameters.Similar terms appear in the equation for Q 1 and R 1 .To close the equations, we are thus led to introduce order parameter densities in the next step.

Order parameters as integrals over densities
To proceed, we introduce the densities q(ρ, s), r(ρ, s) and t(ρ, s).These depend on ρ and on the normalised number of steps s = µ /D: B.22) where 1(.) is the indicator function and the limit ρ → 0 is taken after the thermodynamic limit.The order parameters are obtained by integrating these densities over the spectrum of the input-input covariance matrix: It follows that tracking the dynamics of the functions q, r and t, we obtain the dynamics of the overlaps Q 1 , R 1 and T 1 .
The dynamics of representation learning in shallow, non-linear autoencoders Dynamics of t(s, ρ) The dynamics of t is given straightforwardly from eq. (B.21) as: where we defined the rescalled eigenvalue ρ ≡ ρ /D.Here again, straightforward algebra shows that the second order term is sub-leading in the limit of high input dimensions.
Dynamics of q(s, ρ) In a similar way, we compute the dynamical equation for q(s, ρ) by using the encoder's weight's update in eq.(B.14) and taking the expectation over the input distribution.
In the above, we have the appearance of the expectations: To compute these, we use our results for weakly correlated variables from appendix B.2 and obtain: Using these results in the equations for q(ρ, s) we find: Note, that as explained in the main text, in order to find a well defined D → ∞ limit, we rescale the learning rate of the encoder's weights as η W = η /D.
Dynamics of r(s, ρ) Lastly, we obtain the equation for r(ρ, s) in the same way as the two previous ones.We use eq.(B.14) and evaluate the expectation over a fresh sample x.
We can evaluate the expectations as before (B.2).Thus, we obtain the equation of motion for r(ρ, s) as:

B.4. Simplification of the equations for spiked covariance matrices
In the case of the synthetic dataset defined as x = Ac + ξ, the spectrum of the covariance matrix is decomposed into a bulk of small, i.e.O(1) eigenvalues, of continuous support, and a few outlier eigenvalues taking values of order O(D) (see fig. 1).This allows to obtain M equations for controlling the evolution of the spikes modes and one equation controlling the evolution of the bulk for all order parameters.Indeed we can simplify the equations of motion of the bulk eigenvalues (i.e.those for which ρ ∼ O(1)) by neglecting terms proportional to ρ /D 1. Doing so leads to: We can define bulk order parameters, that take into account the contribution from the bulk modes as: Note that even though the integrals involve ρ /D ∼ O( 1 /D) since we are integrating over a large number (O(D)) of modes, the result is of order 1.The equation for these overlaps are thus obtained as the integrated form of Eq. (B.31): (B.33) The dynamics of representation learning in shallow, non-linear autoencoders The order parameters can be decomposed into the contribution from the bulk eigenvalues and those of the spikes as: and similarly for Q 1 and R 1 .The M spike modes t i , r i and q i obey the full equations Eq. (B.24) , Eq. (B.29) and Eq.(B.28).
In particular, it is clear from Eq. (B.33) that for any non-zero weight decay constant κ, the bulk contribution to Q 1 will decay to 0 in a characteristic time κ −1 .Further note that the equations for T bulk and R bulk result in an exponential decay of the bulk modes towards 0.

C. Reduced equations for long-time dynamics of learning
In this section we illustrate how to use the equations of motion in order to study the long training time dynamics of shallow non-linear autoencoders.At sufficiently long times, we have seen that the weights of the network span the leading PC subspace of the covariance matrix.We restrict to the case in which the eigenvectors are recovered directly.The case in which a rotation of them is found follows straightforwardly.We also restrict to the matched case in which K = M .
Consider the following ansatz on the weights configuration: We defined dynamical constants α k w , α k v ∈ R which control the norm of the weights.Using this ansatz, the order parameters take the form: Using the above, we can rewrite the pmse as: The first observation is that the generalisation error reaches a minimum when the term in bracket f k is minimised.Minimising it leads to an equation α * k v (α w ) at which f k (α * k v (α w ), α w ) = 0 and the pmse is equal to the PCA reconstruction error.For a linear activation function, we have The pmse is a function only of the product α k2 ≡ α k w α k v and is minimal and equal to the PCA reconstruction error whenever: The above is nothing but the intuitive result that for a linear autoencoder, any rescaling of the encoder's weights can be compensated by the decoder's weights.
For a sigmoidal autoencoder, instead, the expressions of the integrals given in App.D, give: Differentiating f K with respect to α k w , and requiring the derivative to be 0, we find the equation: We point out that this solution is independent of the sign of α k ω or α k v as recovering the eigenvectors or minus the eigenvectors is equivalent.Replacing this solution for α k ω in the expression for the pmse, we find f k (α k * v (α w ), α k w ) = 0.The autoencoder reaches the same reconstruction error as the one achieved by PCA.The validity of the equations is verified in Fig. 3 where we compare the result of integrating them to simulations.We provide a Mathematica notebook which allows to derive these equations on the Github repository of this paper.
To validate our calculations for autoencoders with finite D, we trained an autoencoder from three distinct initial conditions: A random one, an informed configuration in which the weights are initialised as in Eq. 19 with α k v = α k w = 1 and a perfect one, in which additionally, α k v is initialised from the minimal pmse solution Eq.C.6.The first observation is that the pmse of the network started with perfect initial conditions remains at the PCA reconstruction error, thus validating Eq.C.6.Since we have a noiseless dataset, the PCA error is given by the numerical error, however we note that the same conclusion caries over to noisy datasets with finite PCA reconstruction error.In fig. 10, we plot the pmse of the randomly initialised and informed autoencoders on the left as well as the evolution of α k v and α k w , i.e. of the overlap between the weights and the eigenvectors during training, on the right.We can see that the network with random initial weights learns in two phases: first the pmse decays exponentially until it reaches the error of the aligned network.Then, its pmse coincides with that of the informed network and the error decays as a power-law.This shows that during the first phase of exponential learning, the network recovers the leading PC subspace.This occurs early on in training.In the second phase, the networks adjust the norm of the weights to reach PCA performances.As discussed in the main text, this is achieved in the sigmoidal network by shrinking the encoder's weights, so as to recover the linear regime of the activation function.It keeps the norm of the reconstruction similar to the one of the input by and growing the decoder's weights.

D. Analytical formula of the integral
In this subsection we define C = (c ij ) i,j=1,..,k the covariance matrix of the variables {x i } i=1,..,k appearing in the expectation.For e.g.E x 1 x 2 − E x 1 E x 2 = c 12 .

Figure 1 .
Figure 1.Shallow autoencoders and their inputs (a) We analyse two-layer autoencoders with non-linear activation function in the hidden layer.(b) Top: Rescaled eigenvalues of the covariance matrix of inputs drawn from the spiked Wishart model in eq.(3), which can be divided into a bulk and a finite number of outliers.Bottom: example inputs drawn from CIFAR10(Krizhevsky et al., 2009), a benchmark data set we use for our experiments with realistic data in section 4.

Figure 2 .
Figure2.The dynamical equations describe the learning dynamics of autoencoders Prediction mean-squared error pmse (2) of an autoencoder with K = 3 hidden neurons during training on with stochastic gradient descent on the spiked Wishart model in the one-pass limit.We obtained the pmse directly from a simulation (crosses) and from integration of the dynamical equations describing learning that we derive in section 3.1 (line).Horizontal lines indicate the PCA reconstruction error with increasing numbers of principal components.Parameters: M = 3, η = 1, D = 1000.

Figure 3 .
Figure3.Two phases of learning in sigmoidal autoencoders.(a) Prediction error pmse (2) of a sigmoidal autoencoder with a single hidden neuron starting from random initial conditions (blue).The error first decays exponentially, then as a power law.During this second phase of learning, the pmse is identical to the pmse of an autoencoder whose weights are proportional to the leading eigenvector Γ of the input covariance at all times: w(t) ∝ αw(t)Γ, v(t) ∝ αv(t)Γ, cf.eq.(19).(b) Dynamics of the scaling constants αw and αv obtained through simulations of an autoencoder starting from random weights (crosses) and from integration of a reduced set of dynamical equations, eq.(C.8).(c,d) The norm of the encoder (left) and decoder (right) weights.Their weights shrink, respectively grow, as a power law √ ρ(ηt) ±δ with exponent δ 1/6.This allows the sigmoidal network to remain in the linear region

Figure 4 .
Figure 4. Shallow autoencoders require untied weights, and ReLU autoencoders also need biases.(a) Prediction error pmse (2) of sigmoidal autoencoders with untied weights (red) and with tied weights v k = w k (blue).Horizontal lines indicate PCA errors for rank r.(b) Same plot for ReLU autoencoders with three different architectures: tied weights, no bias (blue), untied weights without bias (red) and untied weights with bias (green).Only the latter architecture achieves close to PCA error.(c) Distribution of biases in a ReLU network before (top) and after training (bottom).The biases increase throughout training, hence pushing the network into the linear region of its activation function.Parameters: D = 1000, K = 5, η = 1.

Figure 5 .
Figure 5. Breaking the symmetry between neurons yields the exact principal components of the data.(a) Prediction mean-squared error (2) of an autoencoder trained using vanilla SGD (eq.4, blue) and truncated SGD (eq.23, orange).Grey horizontal lines indicate PCA reconstruction errors.(b) We show the weight vector of the first neuron of the autoencoder before and after training (top vs bottom) with vanilla and truncated SGD (left and right, resp.).In contrast to vanilla SGD, truncated SGD recovers the true leading principal component of the inputs, a sinusoidal wave (black).(c) Example image taken from Fashion MNIST.(d) The left column shows reconstructions of the image from (c) using all K = 64 neurons of an autoencoder trained with vanilla (top) and truncated SGD (bottom) on the full Fashion MNIST database.The right column shows reconstructions using only the first 5 neurons of the same autoencoders.Parameters: η = 1, K = 4 ((a) and (b)), K = 64 ((c) and (d)), bs = 1, P = 60000.

Figure 8 .Figure 9 .
Figure 8. Gaussian equivalence for autoencoders Train and test mean-squared error for linear, sigmoidal and relu autoencoders with different numbers of hidden neurons (K = 4, 64, 128 in blue, orange and green, resp.)The autoencoders were trained on CIFAR10 grayscale images (solid lines) or, starting from the same initial conditions, on Gaussian inputs with the same covariance (dashed lines).The agreement is essentially perfect.Parameters: 10k training samples, D = 1024, mini-batch size=1, η = 1.

Figure 10 .
Figure10.Learning in non-linear shallow AE occurs in two phases (a) The pmse of an auto-encoder trained from random initial conditions first decays exponentially and then as a power law in time.Dynamics of the scaling constants αv and αw The reduced set of 2K ODEs match the results of simulations.The dynamics of an AE initialised randomly with random initial conditions become indistinguishable with those of an AE whose weights are initialised proportional to the PCs.This shows that the network first recovers the leading PCs subspace and then adjusts the norm of the weights in order to recover the linear regime.Parameters: η = 1, K = 64, bs = 1