Redundant representations help generalization in wide neural networks

Deep neural networks (DNNs) defy the classical bias-variance trade-off: adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this ``benign overfitting'' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information, and differ from each other only by statistically independent noise. The number of such groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training process reaches interpolation and the training error is zero.

In this paper, we describe a phenomenon in wide DNNs that could be a possible mechanism for benign overfitting when the networks are trained with regularization.We illustrate this mechanism in Fig. 1 for a family of increasingly wide DenseNet40s [24] trained on CIFAR10 [25] following common practice, in particular using weight decay (see Sec. 2.1).For simplicity, we refer to the width W of the last hidden representation as the width of the network.The blue line in Fig. 1-b shows that the average classification error (error) approaches the performance of a large ensemble of networks (error ∞ ) [21] as we increase the network width W .In agreement with [26], we find that the performance of these DenseNets improves continuously with width.For widths greater than 350, the networks are wide enough to reach zero training error (see Appendix, Sec.B, Fig S2-c) and, Figure 1: The redundancy of representations in wide neural networks.a: We analyze the final representations of deep neural networks (DNN), namely the activities of the last hidden layer of neurons (light blue) We focus on the performance and the statistical properties of randomly chosen subsets of w c neurons which we call "chunks".In the chunked network shown here, w c = 5 out of 9 neurons are kept and used to predict the output.b: As we increase the size of the chunk w c that we keep in a state-of-the-art DNN, here a DenseNet40, the test error of the chunk (orange line) becomes similar to the test error of a full network of width W = w c (blue line).In this regime, which is reached when w c is larger than a threshold w * c (shaded area) the error approaches its asymptotic value error ∞ as a power-law w and more clones as the network grows wider as we illustrate at the bottom of Fig. 1.The accuracy of these wide networks then improves with their width because the network implicitly averages over an increasing number of clones in its representations to make its prediction.
This paper provides a quantitative analysis of this phenomenon on various data sets and architectures.Our main findings can be summarized as follows: 1.A chunk of w c random neurons of the last hidden representation of a wide neural network predicts the output with an error that decays as w if the layer is wide enough and w c is large enough.In this regime, we call the chunk a "clone"; 2. Clones fit the training set with zero error and can be linearly mapped one to another, or to the full representation, with an error that can be described as uncorrelated random noise.3. Clones appear if the model is trained with weight decay and the training set is fitted with zero error.If training is stopped too early or if the training is performed without regularization, 1. and 2. do not take place, even if the last representation is very wide.

Neural network architectures
We report experimental results obtained with several architectures (fully connected networks, Wide-ResNet-28, DenseNet40, ResNet50) and data sets (CIFAR10/100 [25], ImageNet [27]).We train all the networks using SGD with momentum and, importantly, weight decay.The amount of weight decay is found with a small grid search, while the other relevant hyperparameters are set following standard practice.We give detailed information on our training setups in Sec.A of the Appendix.All our experiments are run on Volta V100 GPUs.In the following paragraphs, we describe how we vary the width W of the models.
Fully-connected networks on MNIST.We train a fully-connected network to classify the parity of the MNIST digits [28] (pMNIST) following the protocol of Geiger et al. [21].MNIST digits are projected on the first ten principal components, which are then used as inputs of a five-layer fully-connected network (FC5).The four hidden representations have the same width W and the output is a real number whose sign is the predictor of the parity of the input digit.
Wide-ResNet-28 and DenseNet40 on CIFAR10/100.We train CIFAR10 and CIFAR100 on family of Wide-ResNet-28 [26] (WR28).The number W of the last hidden neurons in a WR28-n is 64 • n, obtained after average pooling the last 64 • n channels of the network.In our experiments, we also analyze two narrow versions of the standard WR28-1 which are not typically used in the literature.We name them WR28-0.25 and WR28-0.5 since they have 1/4 and 1/2 of the number of channels of WR28-1.Our implementation of DenseNet40 follows the DenseNet40-BC variant [24].We vary the number of input channels c in {16, 32, 64, 128, 256}, which is twice the growth rates k of the networks [24].The number W of the last hidden features of this architecture is 5.5 • c.
ResNet50 on ImageNet.We modify the ResNet50 architecture [29] by multiplying by a constant factor c ∈ {0.25, 0.5, 1, 2, 4} the number of channels of all the layers after the input stem.When c = 2 our networks differ from the standard Wide-ResNet50-2 [26] since we double the channels of all the layers and not just those of the bottleneck of the ResNet blocks.As a consequence in our implementation, the number of features after the last pooling layer is W = 2048 • c while in [26] W is fixed to 2048.

Analytical methods
Reconstructing the wide representation from a smaller chunk.To determine how well a subset of w neurons can reconstruct the full representation of size W we search for the W × w linear map A, able to minimize the squared difference (x (W ) − x(W ) ) 2 between the W activations of the full layer representation, x (W ) , and the activations predicted from the chunk of size w, x(W ) : This least-squares problem is solved with ridge regression [30] with regularization set to 10 −8 , and we use the R 2 coefficient of the fit to measure the predictive power of a given chunk size.The R 2 value is computed as an average of the single-activations R 2 values corresponding to the W output coordinates of the fit, weighted by the variance of each coordinate.We further compute the W × W covariance matrix C ij of the residuals of this fit, and from C ij we obtain the correlation matrix as: with a small regularization in the denominator to avoid instabilities when the standard deviation of the residuals falls below machine precision.To quantify how much the errors of the fit are correlated, we average the absolute values of the non-diagonal entries of the correlation matrix ρ ij .For short, we refer to this quantity as a 'mean correlation'.
Reproducibility.We provide code to reproduce our experiments and our analysis online at https: //github.com/diegodoimo/redundant_representation. .The mechanism we propose is inspired by the following experiment: we compute the test accuracy of models obtained by selecting a random subset of w c neurons from the final hidden representation of a wide neural network.We select w c neurons at random and we compute the test accuracy of a network in which we set to zero the activation of all the other w − w c neurons of the final layer.Importantly, we do not fine-tune the weights after selecting the w c neurons: all the remaining parameters of the previous layers are left unchanged and only the the activations of the "killed" neurons of the last hidden representation are not used to compute the logits.We take 500 random samples of neurons for each chunk width w c .We consider three different data sets: pMNIST trained on a fully connected network, CIFAR10 and CIFAR100 trained on convolutional networks.The width W of the network is 512 for pMNIST and CIFAR10, and W = 1024 for CIFAR100 (see Sec. 2.1).In all these cases, W is large enough to be firmly in the regime where the accuracy of the networks scales (approximately) as W −1/2 (see Fig. 2).

Results
In Fig. 3 we plot the test error of the "chunked models" as a function of w c (orange lines).The behavior is similar in all three networks: the test error decays as w −1 /2 c for chunks that are larger than a critical value w * c , which depends on the data set and architecture used.This decay follows the same law observed for full networks of the same width (Fig. 2).This implies that a model obtained by selecting a random chunk of w c > w * c neurons from a wide final representation behaves similarly to a full network of width W = w c .Furthermore, a decay with rate −1 /2 suggests that the final representation of the wide networks can be thought of as a collection of statistically independent estimates of a finite set of data features relevant for classification.Adding neurons to the chunk hence reduces their prediction error in the same way an additional measurement reduces the measurement uncertainty, leading to the −1 /2 decay.
At w c smaller than w * c instead, the test error of the chunked models decays faster than w in all the cases we considered, including the DenseNet architecture trained on CIFAR10 shown in Fig. 1-b.In this regime, adding neurons to the final representation improves the quality of the model significantly quicker than it would in independently trained models of the same width (see Fig. 1-c for a pictorial representation of this process).We call chunks of neurons of size w c ≥ w * c clones.In the following, we characterize more precisely the properties of the clones.c .If distinct clones provide independent measures of the same salient features of the data, the test error decays approximately as n −1 /2 or equivalently W −1 /2 .In the following, we will indeed see that distinct clones differ from each other by uncorrelated random noise.
Clones reconstruct the full representation almost perfectly.From a geometrical perspective, the important features of the final representation correspond to directions in which the data landscape shows large variations [31].A clone is a chunk that is wide enough to encode almost exactly these directions (since its training error is almost zero), but using much fewer neurons than the full final representation.We analyze this aspect by performing a linear reconstruction of the W activations of the last hidden representation of the widest network starting from a random subset of w c activations using ridge regression with a small regularization penalty according to Eq. ( 1).The blue profiles in Fig. 4-(d,e,f), show the R 2 coefficient of fit as a function of the chunk size w c for pMNIST (left), CIFAR10 (center), CIFAR100 (right).When w c is very small, say below 6 for pMNIST, 20 for CIFAR10 and 60 for CIFAR100, the R 2 coefficient grows almost linearly with w c3 .In this regime, adding a randomly chosen activation from the full representation to the chunk increases substantially R 2 .When w c becomes larger R 2 reaches almost one.This transition happens when w c is still much smaller than W and corresponds approximately to the regime in which the test error starts scaling with the inverse square root of w c (see Fig. 3).The almost perfect reconstruction of the original data landscape with few neurons is a consequence of the low intrinsic dimension (ID) of the representation [32].The ID of the widest representations gives a lower bound on the number of coordinates required to describe the data manifold, and hence on the neurons that a chunk needs in order to have the same classification accuracy as the whole representation.The ID of the last hidden representation is 2 in pMNIST, 12 in CIFAR10, 14 in CIFAR100, numbers which are much lower than w * c , the width at which a chunk can be considered a clone.
Clones differ from each other by uncorrelated random noise.When w c > w * c the small residual difference between the chunked representation and the full representation can be approximately The training errors of the full networks (blue) and of the chunks taken from the widest network (orange) approach zero beyond a critical width/chunk size, resp.(panels a-c).(ii) The final representation of the widest network can be reconstructed from a chunk using linear regression (1) with an explained variance R 2 close to 1 (blue lines in panels d-f).(iii) The residuals of the linear map can be modeled as independent noise: we show this by plotting the mean correlation of these residuals (green line, panels d-f), averaged over 100 reconstructions starting from different chunks.A low correlation at high R 2 indicates that the chunk contains the information of the full representation with some statistically independent noise.Experimental setup: FC5 on pMNIST, Wide ResNet-28 on CIFAR10/100.Full details in Methods section 2.1 described as statistically independent random noise.The green profiles of the bottom panels of Fig. 4 show the mean correlation of the residuals of the linear fit (see Sec. 2.2).Below w * c , the residuals are not only large but also significantly correlated, since they are related to relevant features of the data that are not covered by the neurons of the chunk.As the chunk width increases above w * c , the correlation between residuals drops basically to zero.Therefore, in networks wider than w * c any two chunks of equal size w c > w * c can be effectively considered as equivalent copies, or clones, of the same representation (that of the full layer), differing only by a small and non-correlated noise, consistently with the scaling law of the error shown in Fig. 3.
The dynamics of training.In the previous paragraphs, we set forth evidence in support of the hypothesis that large chunks of the final representation of wide DNNs behave approximately like an ensemble of independent measures of the full feature space.This allowed us to interpret the decay of the test error of the full networks with the network width observed empirically in Fig. 2. The three conditions that a chunked model satisfies in the regime in which its test error decays as w −1 /2 c are represented in Fig. 4: (i) the training error of the chunked model is close to zero; (ii) the chunked model can be used to reconstruct the full final representation with an R 2 ∼ 1 and (iii) the residuals of this reconstruction can be modeled as independent random noise.These three conditions are all observed at the end of the training.We now analyze the training dynamics.We will see that for the clones to arise, models not only need to be wide enough but also, crucially, they need to be trained to maximize their performance.
Clones are formed in two stages, which occur at different times during training.The first phase begins as soon as training starts: the network gradually adjusts the chunk representations in order to produce independent copies of the data manifold.This can be clearly observed in Fig. 5-a, which depicts the mean correlation between the residuals of the linear fit from the chunked to the full final representations of the network, the same quantity that we analyze in Fig. 4-(d-e-f, green profiles), Importantly, both phases improve the generalization properties of the network.This can be seen in Fig. 5-c, which reports the training and test error of the network, with the two phases highlighted.The figure shows that both phases lead to a reduction in the test error, although the first phase leads by far to the greatest reduction, consistent with the fact that the greatest improvements in accuracy typically arise during the first epochs of training.The formation of clones can be considered finished around epoch 180 when all the clones have reached almost zero error on the training set.After epoch 180 we also observe that the test error stops improving.In the Appendix (Sec.B) we report the same analysis done on CIFAR100 (see Fig. S1) and CIFAR10 trained on a DenseNet40 (see Fig. S2-(d-e-f)).
Clones appear only in regularized networks.So far in this work, we have shown only examples of regularized networks and data sets in which representations are redundant.However, if the network is not regularized, some of the signatures described above don't appear even if the width of the final representation is much larger than W * (the minimum interpolating width).Figure 6 shows the case of the Wide-ResNet28-8 analyzed in Fig. 5 trained on exactly the same data set (CIFAR10) but without weight decay.As shown in Fig. 6-a in the network trained without regularization (blue line) the error does not scale as w . This, as we have seen, indicates that the last hidden representation cannot be split in clones equivalent to the full layer.Indeed, the mean correlation of the residuals of the linear map of the chunks to the full representation remains approximately constant during training (Fig. 6-b), and is always much higher than what we observed for the same architecture and data set when training is performed with weight decay.We performed a similar analysis on the DenseNet40 (see Fig. S3), observing an analogous trend.
Clones appear only if a network interpolates the training set: the case of ImageNet.We saw that a chunk of neurons can be considered a clone if it fully captures the relevant features of the data, achieving almost zero training error (see Fig. 4).This condition is not satisfied for most of the networks trained on ImageNet [16], therefore we do not expect to see redundant representations in this important case.We verified this hypothesis by training a family of ResNet50s where we multiply all the channels of the layers after the input stem by a constant factor c ∈ {0.25, 0.5, 1, 2, 4}.In this manner the widest final representation we consider consists of 8192 neurons, which is four times wider than both the standard ResNet50 [29] and its wider version [26] (see Sec. 2.1).We trained all the networks following the standard protocols and achieved test errors comparable to or slightly lower than those reported in the literature (see Appendix, Sec.A).We find that even in the case of the largest ResNet50, the top-1 error on the training set is ∼ 8% (see Fig. 7-a) and the network does not achieve interpolation, as discussed also in [16].
In this setting, none of the elements associated with the development of independent clones can be observed.The scaling of the test error of the chunks is steeper than w −1/2 c (see Fig. 7-b) suggesting that chunks remain significantly correlated to each other.Figure 7-c shows that the mean correlation of the residuals does not decrease during training, as it happens for the networks we trained on CIFAR10 and CIFAR100.We conclude that in a ResNet50, a representation with 8192 neurons is too narrow to encode all the relevant features redundantly on ImageNet, and a chunk as large as 4096 activations is not able to reconstruct all the relevant variations of the data as it does in the cases analyzed in Sec. 3.

Discussion
This work is an attempt to explain the paradoxical observation that over-parameterization boosts the performance of DNNs.This "paradox" is not a peculiarity of DNNs: if one trains a prediction model with n parameters using the same training set, but starting from independent initial weights and receiving samples in an independent way, one can obtain, say, m models which, in suitable conditions, provide predictions of the same quantity with independent noise due to initialization, SGD schedule, etc.If one estimates the target quantity by an ensemble average, the statistical error will (ideally) scale with m −1 /2 , and therefore with N −1 /2 , where N = n m is the total number of parameters of the combined model.This will happen even if N is much larger than the number of data.
What is less trivial is that a DNN can accomplish this scaling within a single model, in which all the parameters are optimized collectively via the minimization of a single loss function.Our work describes a possible mechanism at the basis of this phenomenon in the special case of neural networks in which the last layer is very wide and the model is regularized.We observe that if the layer is wide enough, random subsets of its neurons can be viewed as approximately independent representations of the same data manifold (or clones).This implies a scaling of the error with the width of the layer as W −1 /2 , which is qualitatively consistent with our observations.The impact of network architecture.The capability of a network to produce statistically independent clones is a genuine effect of the over-parametrization of the whole network as we find that redundancies appear even if the last layer width is kept constant and the width of all intermediate layers is increased (see Appendix, Sec.B, Fig. S4-a, ).At the same time, we also verified that if the network is too narrow to interpolate the training set, increasing the width of only the final representation is not sufficient to make the last layer redundant.We give an example of this effect in Fig. S4-b, where we show that the test error of a WR28-1 on CIFAR10 does not decrease if only the width of the final representation is increased, while the rest of the architecture is kept at a constant width.
The impact of training.The mechanism we described is robust to different training objectives since we trained the convolutional networks with cross-entropy loss and the fully connected networks with hinge loss.However, even for wide enough architectures, clones appear only if the training is continued until the training error reaches zero.In our examples, by stopping the training too early, for example when the training error is similar to the test error, the chunks of the last representation would not become entirely independent from one another, and therefore they could not be considered clones.
Neural scaling laws.Capturing the asymptotic performance of neural networks via scaling laws is an active research area.Hestness et al. [33] gave an experimental analysis of scaling laws w.r.t. the training data set size in a variety of domains.Rosenfeld et al. and Kaplan et al. [34,35] experimentally explored the scaling of the generalization error of deep networks with the number of parameters/data points across architectures and application domains for supervised learning, while Henighan et al. [36] identified empirical scaling laws in generative models.Bahri et al. [37] showed the existence of four scaling regimes and described them theoretically in the NTK or lazy regime [38][39][40], where the network weights stay close to their initial values throughout training.None of these works propose a mechanism that would explain these scalings with properties of the representation.Geiger et al. found that the generalization error can be related to the fluctuations of the output induced by initialization and showed that it scales as W −1 in networks trained without weight decay both in the NTK [21] and in the mean field [41] regimes.We instead consider the feature learning regime and train our networks with weight decay which is unavoidable to obtain models with state-of-the-art performance.This might explain the difference in the scaling law that we observe empirically.Previous theoretical work did not study the impact of weight decay on scaling laws, so we hope that our results can spark further studies on the role of this essential regularizer.
Relation to theoretical results in the mean-field regime.Our empirical results also agree with recent theoretical results that were obtained for two-layer neural networks [42][43][44][45][46][47].These works characterize the optimal solutions of two-layer networks trained on synthetic data sets with some controlled features.In the limit of infinite training data, these optimal solutions correspond to networks where neurons in the hidden layer duplicate the key features of the data.These "denoising solutions" or "distributional fixed points" were found for networks with wide hidden layers [42][43][44][45] and wide input dimension [46,47].Another point of connection with the theoretical literature is the concept of dropout stability.A network is said to be -dropout stable if its training loss changes by less than when half the neurons are removed at random from each of its layers [48].Dropout stability has been rigorously linked to several phenomena in neural networks, such as the connectedness of the minima of their training landscape [49,50].
Bias-variance trade-off and implicit ensembling The success of various deep learning architectures and techniques has been linked to some form of ensembling.The successful dropout regularisation technique [51,52] samples from an exponential number of "thinned" networks during training to prevent co-adaptation of hidden units.While this can be seen as a form of (implicit) ensembling, here we observe that co-adaptation of hidden units in the form of clones occurs without dropout, and is crucial for their improving performance with width.Recent theoretical work on random features showed that ensembling and over-parameterization are two sides of the same coin and that both mitigate the increase in the variance of the network that classically leads to worse performance with over-parameterization due to the bias-variance trade-off [18][19][20].The plots of bias and variance in Fig. S5 for the architectures trained on the CIFAR10 and CIFAR100 data sets show that the clone size in these cases is slightly above the peak of the variance and almost coincides with the interpolation width of the full networks of the same size.
Impact for applications.The framework introduced in this work allows verifying if a neural network is sufficiently expressive to encode multiple statistically independent representations of the same ground truth, which, we believe, is a fair proxy of model quality and robustness.In particular, we find that reaching interpolation on the training set is not necessarily detrimental for generalization, and is instead a necessary condition for developing redundancies which, in turn, reduces the test error.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?[Yes] See Methods section.(c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?[Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?[Yes] We ran all our experiments on V100 GPUs 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators?[Yes] We used the standard MNIST [28], CIFAR10-CIFAR100 data sets [25] and ImageNet [27].

A Hyperparameters used and training procedures
Fully-connected networks on MNIST.We train the fully-connected networks for 5000 epochs with stochastic gradient descent using the following hyperparameters: batch size = 256, momentum = 0.9, learning rate = 10 −3 , weight decay = 10 −2 .We optimize our networks using Adam.
Wide-ResNet-28 and DenseNet40-BC on CIFAR10/100.All the models are trained for 200 epochs with stochastic gradient descent with a batch size = 128, momentum = 0.9, and cosine annealing scheduler starting with a learning rate of 0.1.The training set is augmented with horizontal flips with 50% probability and random cropping the images padded with four pixels on each side.On CIFAR10 trained on WR28 we select a weight decay equal to 5 • 10 −4 and label smoothing magnitudes equal to 0.1 for WR28-{0.25, 0.5, 1, 2} and equal to 0 for WR28-{4, 8}.On CIFAR10 trained on Densenet40-BC we set a weight decay equal to 5 • 10 −4 and label smoothing magnitudes equal to 0.05 for all the networks On CIFAR100 trained on WR28 we set weight decays equal to {10, 7, 5, 5, 5}•10 −4 and label smoothing magnitudes equal to {0.1, 0.07, 0.05, 0, 0} for WR28-{1, 2, 4, 8, 16} respectively.All the hyperparameters were selected with a small grid search.
ResNet50 on ImageNet.We train all the ResNet50 with mixed precision [53] for 120 epochs with a weight decay of 4 • 10 −5 and label smoothing rate of 0.1 [54].The input size is 224 × 224 and the training set is augmented with random crops and horizontal flips with 50% probability.The per-GPU batch size is set to 128 and is halved for the widest networks to fit in the GPU memory.The networks are trained on 8 or 16 Volta V100 GPUs so as to keep the batch size B equal to 1024.The learning rate is increased linearly from 0 to 0.1•B/256 [55] for the first five epochs and then annealed to zero with a cosine schedule., even when the width of the final representation is fixed.A bottleneck of 128 channels makes the clones much smaller: The orange profile shows that a strong deviation from the w −1 /2 can be seen for chunk sizes smaller than 16 (vs 350 Fig. 1b, main paper).We also verified that 16 random neurons are sufficient to interpolate the training set (error < 5 • 10 −3 ) and that the R 2 coefficient of fit to the full layer is 0.912 (0.98 for chunk sizes = 32).The phenomenology described in the manuscript applies also when a bottleneck of 128 channels is added at the end of the network.In a second experiment, we trained a ResNet28-1 increasing only the number of channels in the last layer.We modified the number of output channels of the last block of conv4 and analyzed the representation after average pooling, as we did in the other experiments.The network was trained for 200 epochs using the same hyperparameters and protocol described in Sec. 2. Figure S4-b shows that the test error of the modified ResNet28-1 is approximately constant (blue profile).On the contrary, when we increase the width of the whole network the test error decays to the asymptotic test error with an approximate scaling of 1/ √ w (orange profile).

a b c
Figure S5: Bias-variance profiles in CIFRA10 and CIFAR100.We compute the bias and the variance profiles for the convolutional architectures analyzed in the paper: Wide-ResNets and DenseNets trained on CIFAR10, and Wide-ResNets trained on CIFAR100.Since we trained the models using the cross-entropy loss, the standard bias-variance decomposition, which assumes the square loss, does not apply.Instead, we used the method recently proposed by Yang et al. [56] to estimate the bias and the variance on networks trained with cross-entropy loss.The average over the data distribution is approximated by splitting the CIFAR training sets into five disjoint subsets containing 10 000 images each and training the networks from scratch on each of them.We use the same regularization for all the networks, namely that of the largest architectures, with weight decay equal to 5 • 10 −4 and label smoothing equal to 0. We repeat the procedure 4 times, for a total of 20 training runs for each network width, as described in Ref. [56].We show the test loss curves as well as the squared bias and variance.As expected, the bias of the models decreases as we add parameters and make the model more flexible.The variance of the models initially grows with width to reach its peak at W peak = 32 and 64 for CIFAR10 and CIFAR100 trained on Wide-ResNet28 (a, b) and W peak = 88 on CIFAR10 trained on DenseNet40 (c).As we increase the width, the variance decreases, allowing the model to generalize better and better and defying the classical bias-variance trade-off.The clone size w * c for these architectures are slightly above the widths at which the variance peaks and are w * c = 64, and 128 for CIFAR10 and CIFAR100 trained on Wide-ResNet28 (compare Figs. 3 and 4) and w * c = 170/250 (Fig. S2).In all cases, the onset of the clones occurs at a width that is approximately two times larger than W peak , similar to the width at which an architecture of size w * c interpolates the training set.

Figure 2 :
Figure 2: Scaling of the test error with width for various DNN.The average test error of neural networks with various architectures approaches the test error of an ensemble of such networks as the network width increases.The network size shown here is the width of the final representation.For large width, we find a power-law behavior error − error ∞ ∝ W −1/2 across data sets and architectures.Full experimental details in Sec.2.1

Figure 3 :
Figure 3: Scaling of the test error of chunks of neurons extracted from the final hidden representation of wide NNs.We plot how the test error of chunked networks approaches error ∞ , the error of an ensemble of 20 networks of the widest size (e.g.W = 1024 for CIFAR100), as the chunk size w c increases.Chunks are formed by selecting a number of w c neurons at random from the final hidden representation of the widest networks: a FC5 on pMNIST (width W = 512), and Wide-ResNet-28 for CIFAR10 (W = 512) and CIFAR100 (W = 1024).The shaded regions indicate regions where the error of the chunks with w c neurons decays as w −1/2 c.

Figure 4 :
Figure 4: The three signatures of representation redundancy.(i) The training errors of the full networks (blue) and of the chunks taken from the widest network (orange) approach zero beyond a critical width/chunk size, resp.(panels a-c).(ii) The final representation of the widest network can be reconstructed from a chunk using linear regression (1) with an explained variance R 2 close to 1 (blue lines in panels d-f).(iii) The residuals of the linear map can be modeled as independent noise: we show this by plotting the mean correlation of these residuals (green line, panels d-f), averaged over 100 reconstructions starting from different chunks.A low correlation at high R 2 indicates that the chunk contains the information of the full representation with some statistically independent noise.Experimental setup: FC5 on pMNIST, Wide ResNet-28 on CIFAR10/100.Full details in Methods section 2.1

Figure 5 :
Figure 5: The onset of clones during training.a: As in Fig. 4, we show the mean correlation of the residuals of the linear reconstruction of the final representation from chunks, but this time as a function of training epochs.A small correlation indicates that the reconstruction error in going from chunks to final representation can be modeled as independent noise.Data obtained from the same WR28-8 trained on CIFAR10 as in Fig. 4. b: Training error during training for chunks of different sizes.After the network has reached zero training error at ∼ 160 epochs, continuing to train improves the training accuracy of the chunks.c: Test and training error during training for the full network.Between epochs 160 and 180, the clones of the full network progressively achieve zero training error.In the same epochs, one observes a small improvement in the test error.

Figure 6 :
Figure 6: A network trained without weight decay on CIFAR10.a: the test error of chunks of a Wide-ResNet28-8 trained without weight decay (blue) and with weight decay (orange, taken from Figure 3-b).b: Mean correlation between residuals of the linear reconstruction of the full representation from chunks of different sizes for two networks: one trained without weight decay (thick lines), and one using weight decay (thin lines, same data as in Fig. 5-a).

Figure 7 :
Figure 7: ResNet50 trained on ImageNet a: ImageNet training error as a function of the ResNet50 width.b: Decay of the test error as a function of the network width (blue) and for chunks of the widest ResNet50 (orange) to the error of an ensemble of ResNet50-4.The ensemble consists of four networks.c: Mean correlation (see Sec. 2.2) of the residuals of the linear map of a chunk of the last hidden representation to the full representation.The network analyzed is ResNet50-4.
See Methods section (b) Did you mention the license of the assets?[N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you're using/curating?[N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?[N/A] 5.If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?[N/A]

Figure S1 :
Figure S1: Training dynamics on CIFAR100.a:As in Fig. 4, we show the mean correlation of the residuals of the linear reconstruction of the final representation of a Wide-Resnet28-8 from chunks, but this time as a function of training epochs.A small correlation indicates that the reconstruction error in going from chunks to the final representation can be modeled as independent noise.b: Training error of chunks of a Wide-Resnet28-8 and its full layer representation.From epoch 150 to epoch 185 the training error of the chunks with size 128/256 decreases below 0.5%, while for smaller chunk sizes it remains above 5%.Random chunks with sizes larger than 128/256 can fit the training set, thus having the same representational power as the whole network on the training data.For W > 128/256 the test accuracy is decaying approximately with the same law as that of independent networks with the same width (see Fig. 3).This picture suggests that for CIFAR100 the size of a clone is 128/256, slightly larger than the size of the clones in CIFAR10.c: Training and test error dynamics for the same Wide-ResNet28-8.After epoch 150 the training error of the full network remains consistently smaller than 0.1% (orange profile) while the test error continues to decrease until epoch 185 from 0.194 to 0.1765 (blue profile).In the same range of epochs (150-185) the training error of smaller chunks decreases sensibly (see panel b).

Figure S2 :
Figure S2: A DenseNet40 architecture.a: Decay of the test error of independent networks (blue) and chunks of the widest network (orange) to the error of an ensemble average of ten of the widest networks (DenseNet40-BC, k=128) b: Blue profile: R 2 coefficient of the ridge regression of a chunk of w c neurons (x-axis) to the full layer representation.Green profile: mean correlation of the residuals of the mapping as described in Sec.2.2.c: Training error of various DenseNet40 of increasing width (blue) and of chunks of the widest architecture (orange).d: The mean correlation of the residuals from the linear reconstruction of the final representation from chunks of a given size for a DenseNet40-BC (k=128) during training.e: Training error dynamics of chunks of a DenseNet40-BC (k=128).f: Training and test error dynamics for a DenseNet40-BC (k=128).

Figure S3 :
Figure S3: A Densenet40 not regularized.A DenseNet40-BC (k=128) trained on CIFAR10 without weight decay.This experiment reproduces on a DenseNet the analysis shown on a Wide-ResNet28 in Sec. 3. It shows that a: also in a DenseNet architecture not well regularized error -error ∞ decays faster than w −1 /2 c and b: the mean correlation of the residuals do not decrease during training.The thin profiles of panel b are the same as those shown in Fig. S2-d.

Figure S4 :
Figure S4: Impact of the width of the intermediate layers.We study how the scaling of the test error is affected (Fig. a) by increasing the width of the intermediate representations while keeping the width of the last layer constant or (Fig. b) by increasing the last layer width while keeping the width of the network constant.In S4-a we trained DenseNet40 on CIFAR10 with an additional 1 × 1 convolution to keep the number of output channels fixed at 128. Figure S4-a shows that increasing the width of intermediate layers makes the test accuracy of the full network decay approximately as w − 1 /2 c

Table 1 :
Test accuracy (average over four runs)