The twin peaks of learning neural networks

Recent works demonstrated the existence of a double-descent phenomenon for the generalization error of neural networks, where highly overparameterized models escape overfitting and achieve good test performance, at odds with the standard bias-variance trade-off described by statistical learning theory. In the present work, we explore a link between this phenomenon and the increase of complexity and sensitivity of the function represented by neural networks. In particular, we study the Boolean mean dimension (BMD), a metric developed in the context of Boolean function analysis. Focusing on a simple teacher-student setting for the random feature model, we derive a theoretical analysis based on the replica method that yields an interpretable expression for the BMD, in the high dimensional regime where the number of data points, the number of features, and the input size grow to infinity. We find that, as the degree of overparameterization of the network is increased, the BMD reaches an evident peak at the interpolation threshold, in correspondence with the generalization error peak, and then slowly approaches a low asymptotic value. The same phenomenology is then traced in numerical experiments with different model classes and training setups. Moreover, we find empirically that adversarially initialized models tend to show higher BMD values, and that models that are more robust to adversarial attacks exhibit a lower BMD.


Introduction
The evergrowing scale of modern neural networks often prevents a detailed understanding of how predictions relate back to the model inputs [Sejnowski, 2020].While this lack of interpretability can hinder adoption in sectors with a high impact on society [Rudin, 2019], the impressive performance of neural network-based models in fields like natural language processing [Vaswani et al., 2017, OpenAI, 2023, Touvron et al., 2023], computational biology [Jumper et al., 2021] and computer vision and image generation [Ramesh et al., 2022, Rombach et al., 2022] have made them the de-facto standard for many real-world applications.This tension has motivated a large interest in the field of explainable AI (XAI) [Montavon et al., 2018, Guidotti et al., 2018, Vilone and Longo, 2020].
Deep learning models, which by now can feature hundreds of billions of parameters [Brown et al., 2020], seemingly defy the notion that increasing model complexity should decrease generalization performance.Counter to what one would expect from statistical learning theory [Vapnik, 1999], the observation has been that larger -heavily overparameterized-models often perform better [Neyshabur et al., 2017].This has led to the question how complex the function represented by an overparameterized neural network is after training.Many lines of research suggest that neural network models are biased towards implementing simple functions, despite their large parameter count, and that this implicit bias is crucial for their good generalization performance [Valle-Perez et al., 2018].The general problem of measuring the complexity of deep neural networks has given rise to several complexity metrics [Novak et al., 2018] and studies on how they relate to generalization [Jiang et al., 2019].
Connected to this, recent studies [Geiger et al., 2019, Belkin et al., 2019] on the effect of overparameterization in neural networks led to the rediscovery of the "double descent" phenomenon, first observed in the statistical physics literature [Opper, 1995], which is the observation that when increasing the capacity of a neural network (measured, for example, by the number of parameters) the generalization error shows a sudden peak around the interpolation point (where approximately zero training error is achieved), but then a second decrease towards a low asymptotic value is observed at higher overparameterization.
In the present work, we study the double descent phenomenon under a notion of function sensitivity based on the mean dimension [Hahn et al., 2022, Hoyt andOwen, 2021].The mean dimension yields a measure of the mean interaction order between input variables in a function, and can also be proved to be related to the variance of the function under local perturbations of the input features.While this notion originated in the field statistics [Liu and Owen, 2006], several computational techniques have been proposed for its estimation in the context of neural networks.One of the main obstacles, however, comes from trying to characterize the sensitivity of the function over an input distribution that is strongly structured and not fully known.
In this paper, we propose to focus on the study of the Boolean mean dimension [O'Donnell, 2014] (BMD), which involves a simple i.i.d.binary input distribution.We show how the BMD can be estimated efficiently, and provide analytical and numerical evidence of the correlation of this metric with several phenomena observed on the data used for training and testing the model.

Overparameterization and Double Descent
Several studies [Baity-Jesi et al., 2018, Geiger et al., 2019, Advani et al., 2020] confirmed the robustness of the double descent phenomenology for a large variety of architectures, datasets, and learning paradigms.An analytical study of double descent in the context of the random feature model [Rahimi et al., 2007] was conducted rigorously for the square loss in [Mei and Montanari, 2019] and for generic loss by [Gerace et al., 2020] using the replica method [Mézard et al., 1987].Double descent has then later found also in the context of one layer model learning a Gaussian mixture dataset [Mignacco et al., 2020]; similarly to the random feature model, the peak in the generalization can be avoided by optimally regularizing the network.In this context in [Baldassi et al., 2020] it was also shown that choosing the optimal regularization corresponds to maximize a flatness-based measure of the loss minimizer.A range of later studies further explored this phenomenology in related settings [d'Ascoli et al., 2020, Gerace et al., 2022].
Different scenarios have also been shown to give rise to a similar phenomenology, such as the epoch-wise double descent and sample non-monotonicity [Nakkiran et al., 2021] and the triple descent that can appear with noisy labels and can be regularized by the non-linearity of the activation function [d'Ascoli et al., 2020].
In this work, we connect the usual double descent of the generalization error with the behavior of the mean dimension, which is a complexity metric that can be evaluated without requiring task-specific data.

Mean Dimension and Boolean Mean Dimension
The mean dimension (MD), based on the analysis of variance (ANOVA) expansion [Efron andStein, 1981, Owen, 2003], can be intuitively understood as a marker of the complexity of a function due to the presence of interactions between a large set of input variables.
The mean dimension has been used as a tool to analyze and compare for example neural networks [Hoyt andOwen, 2021, Hahn et al., 2022] and, with a slightly different definition, also generative models of protein sequences [Feinauer and Borgonovo, 2022].The MD has the advantage that it can be calculated for a black-box function, without regard to the internal mechanism for calculating the input-output relation.One major drawback, however, is the intense computational cost associated with its direct estimation.This computational limitation has led to the proposal of several approximation strategies [Hoyt andOwen, 2021, Hahn et al., 2022].In some special cases, the mean dimension can be explicitly expressed as a function of the coefficients of a Fourier expansion, as seen from the relationship between the Boolean Mean Dimension (BMD) and the total influence [O'Donnell, 2014] defined in the analysis of Boolean functions (see below), and its generalization [Feinauer and Borgonovo, 2022] for functions with categorical variables.

Mean Dimension
In the next paragraphs, we first provide a general mathematical definition of the mean dimension for a square-integrable function with real-valued input distribution.We then specialize to the case of a binary input distribution and define the Boolean Mean Dimension (BMD), which will be the main quantity investigated throughout this paper.Finally, we will discuss how to efficiently estimate the MD and the BMD through a simple Monte Carlo procedure.

Mathematical Definition
To give a proper mathematical definition of the mean dimension, for a real-valued function f (x) of n variables f : R n → R, it is convenient to introduce some notation that will be used in the rest of the paper.We will denote the set of indexes {1, . . .n} by [n].We define x u the set of input variables x i , with i ∈ u ⊆ [n] and by x \u the set of variables for which i / ∈ u.We will also assume that x is drawn from a distribution p(x).The basic idea of the mean dimension [Hahn et al., 2022] is to derive a complexity measure for f from an expansion of the type where the "components" f u (x u ) can be computed from the following recursion relation It can be shown that coefficients of the expansion have zero average if u is non empty where we have denoted by p u (x u ) the marginal probability distribution over the set u.Moreover, they satisfy orthogonality relations, namely Using those relations we can write the variance of the function as a decomposition of 2 n − 1 terms where The mean dimension M f is then defined as [Hahn et al., 2022] i.e. a weighted sum over possible interactions, with each subset of inputs contributing based on how much they influence the variance.

Pseudo-Boolean Functions and Fourier coefficients
We now derive an explicit expression for the mean dimension of n-dimensional pseudo-Boolean functions taking values on the real domain, f : {−1, 1} n → R under the assumption of input features that are i.i.d.from {−1, 1}.
Denoting by s ∈ {−1, 1} n the n-dimensional binary input of f , such a function can be uniquely written as a Fourier expansion [O'Donnell, 2014] in terms of a finite set of Fourier coefficients fu , u ⊆ [n] as where represent the Fourier basis of the decomposition that are orthonormal ⟨χ u (s)χ v (s)⟩ = δ u,v with respect to the uniform distribution over {−1, 1} n , where we use the notation The Fourier coefficients fu can give information about the moments of the function f with respect to the uniform distribution (10) over s; for example the first moment is whereas the variance can be obtained as We can quantify the contribution c k of interaction of order k to the variance of f (s) as the ratio Notice that k c k = 1, so that c k can be interpreted as a (discrete) probability measure over interactions.The mean dimension of f can then be written as the mean interaction degree when weighted according to it contribution to the variance, i.e. as a weighted sum of feature influences divided by the total variance of the function, so This expression is equivalent to Eq. ( 7) for pseudo-Boolean functions under the assumptions that all features are i.i.d from {−1, 1}.The expression connects the notion of simplicity in terms of variance contributions to the same notion in terms of explicit expansion coefficients.Intuitively, a large mean dimension is indicating that the function fluctuates due to a large contribution of high-degree interactions.

Estimating the Mean Dimension through Monte Carlo
The expression of the mean dimension in (7) involves a sum over all the set of subsets of n variables, and its numerical evaluation through a brute-force approach would be intractable in high dimension.However, it can be shown that a more efficient evaluation scheme of equation ( 7), can be achieved through a Monte Carlo approach [Liu and Owen, 2006].First, the MD can be rewritten as a sum over the n input components: where the influence of the i-th input component τ i is defined as: and where we have denoted by x ⊕i a vector x with a resampled i th coordinate.We show an original proof of this identity in Appendix A.
Note that the definition of the MD for a generic input distribution in Eq. ( 16), entails a resampling procedure that presumes knowledge of the conditional distribution of a pixel given the rest of the pixel values.In the general case, this pixel is to be resampled multiple times from this conditional distribution, to compute the variance of the function under this variation of the input.This conditional distribution, however, is not a known quantity for a real dataset.For this reason, for example, some authors have proposed an "exchange" procedure, where one randomply samples a different pixel value observed in the same dataset [Hahn et al., 2022], however this approximation neglects the within sample correlations.
Expression ( 16) can be specialized to the case of binary i.i.d.inputs, where one can identify the influence functions τ 2 i with the discrete derivatives: where D i f (s) denotes the i th (discrete) derivative of f (s), i.e.
and measures the average sensitivity of the function to a flip of the i th variable.The sum of the influences i (D i f (s)) 2 is known in the field of the analysis of pseudo-Boolean functions as total influence of f [O'Donnell, 2014].In terms of the Fourier expansion, we have Therefore computing the mean dimension for pseudo-Boolean functions boils down to querying the function f on uniformly sampled binary sequences of length n − 1.

Boolean Mean Dimension
In the general case, the underlying input distribution of the training dataset is not known and estimating the MD on this distribution becomes unfeasible.In the present work, we propose employing the estimation procedure presented in the last section, based on binary sequences, as an easily computable proxy of the sensitivity of the neural network function.In order to distinguish this proxy from the mean dimension over the dataset distribution, we call the resulting quantity the Boolean Mean Dimension (BMD).We show in the results below that the BMD can in some cases be computed analytically, and that it is qualitatively related to the generalization phenomenology in neural networks.

Analytical results
We now derive an analytic expression for the mean dimension in the special case of the random feature model [Rahimi et al., 2007, Goldt et al., 2019, Loureiro et al., 2021, Baldassi et al., 2022], focusing on the same high dimensional regime where the double descent phenomenon can be detected.In the next sections, we will define the model, the learning task and the high dimensional limit precisely, and we will sketch the analytical derivation of the expression for the Boolean Mean Dimension.

Model definition and learning task
The random feature model (RFM) is a two-layer neural network with random and fixed first-layer weights (also called features) and trainable second-layer weights.Given a D-dimensional input, x ∈ R D , and denoting by F ∈ R D×N the D × N frozen feature matrix, the pre-activation of the RFM is given by: where w is an N -dimensional weight vector and σ is a (usually non-linear) function.The parameter N indicates the number of features in the RFM and can be varied to change the degree of over-parametrization of the model.As in [Baldassi et al., 2022], we will hereafter focus on the case of i.i.d.standard normal distributed feature components F ki ∼ N (0, 1), although the formalism allows for a simple extension to a generic fixed feature map, under a simple weak correlation requirement (see [Gerace et al., 2020, Loureiro et al., 2021] for additional details).We consider a classification task defined by a training dataset of size P , denoted as D = {x µ , y µ } P µ=1 .The inputs are assumed to be i.i.d. with first and second moments fixed respectively to Ex i = 0 and Ex 2 i = 1.Note that, for example, both binary input components x i ∈ {−1, 1} and Gaussian components x i ∼ N (0, 1) satisfy the above assumption.The binary labels y µ ∈ {−1, 1} are assumed to be produced by a "teacher" linear model w T ∈ R D , with normalized weights on the D-sphere ∥w T ∥ 2 2 = D, according to: The learning task is then framed as an optimization problem with generic loss function ℓ and ridge regularization where λ is a positive external parameter controlling the regularization strength.In the following we will consider the two most common convex loss functions, namely the mean squared error (MSE) and the cross-entropy (CE) losses, defined as We analyze the learning problem in the high-dimensional limit where the number of features, input components and training-set size diverge N , D , P → ∞ at constant rates α ≡ P/N = O(1) and α D ≡ D/N = O(1).In this limit, strong concentration properties allow for a deterministic characterization of the above-defined learning problem in terms of a finite set of scalar quantities called order parameters.In the next sections, and in detail in the appendices, we will sketch the derivation of this reduced description.

Rephrasing the problem in terms of the Boltzmann measure
The learning task in ( 22) can be characterized within a statistical physics framework.One can introduce a probability measure over the weights w in terms of the Boltzmann distribution where β is the inverse temperature, the loss function in ( 22) plays the role of an energy, and the partition function Z β is a normalization factor that reads The distribution p β (w; D) can be interpreted in a Bayesian setting as the posterior distribution over the weights w given a dataset D, and ( 24) corresponds to Bayes theorem, where the term , corresponds to the likelihood and e − βλ 2 ∥w∥ 2 2 is the prior distribution over the weights.
In the zero-temperature limit, when β → ∞, the probability measure p β (w; D) concentrates on the solutions to the optimization problem in (22).To characterize the typical (i.e. the most probable) properties of these solutions, one needs to perform an average over the possible realizations of the training set D and of the features F , computing the free-energy of the system The computation of this "quenched" average can be achieved via the replica method [Mézard et al., 1987] from spin-glass theory, which reduces the characterization of the solutions of ( 22) to the determination of a finite set of scalar quantities called order parameters [Engel andVan den Broeck, 2001, Malatesta, 2023].
In appendix B.1, we sketch the replica calculation for the free energy, first presented in [Gerace et al., 2020], in the simplifying case of an odd non-linear activation σ.

Analytical determination of the BMD in the RFM
We now derive an analytic expression for the Boolean Mean Dimension (BMD) which can be efficiently evaluated for a trained RFM.The definition (15) reads where ⟨•⟩ and x ⊕k , defined in ( 10) and ( 16), entail an expectation over i.i.d.uniform binary inputs.In appendix B.2, we perform the annealed averages appearing in the numerator and the denominator separately, obtaining the expression: where we defined and the coefficients κ are defined as expectations of derivatives of the activation function over a standard Gaussian measure Dz = e −z 2 /2 √ 2π dz: As we show in the appendix, the above expression ( 28) is universal: evaluating the MD with respect to a different i.i.d.input distributions with matching first and second moments would give exactly the same result.Moreover, note that the evaluation of expression ( 28) no longer involves a Monte-Carlo over the input distribution, with a major gain in computational cost.In appendix B.2, we show the agreement of this compact formula with the computationally more expensive Monte Carlo estimation of the BMD. Figure 1: Generalization error (top panels) and BMD (bottom panels) as a function of the overparameterization degree 1/α = N/P , for fixed α T = P/D = 3 and with σ = tanh.The left and right panels represents respectively the case of the MSE and of the CE loss.Several value of the regularization λ are displayed, together with the optimal one (which was found by minimizing the generalization error for each value of α, see red dashed line).As it can be seen in both plots, for small regularization λ, the location of the peak in the generalization error exactly coincides with the one in the BMD (vertical dashed lines).As one increases the regularization the peak in the both the generalization and the BMD is milded.
The mean dimension therefore explicitly depends on the model parameters w.The evaluation of the typical BMD of a trained RFM can thus be computed by taking an expectation over the zero-temperature Boltzmann measure for the weights derived in the replica computation, is thus used to indicate an average over the posterior distribution in equation ( 24), in the large β limit.
In the case of the replica computation for an odd activation function, that we reported in appendix B.1, one can simplify further expression (28) by recognizing that κ1 = 0 and that Ω ii = 1 when the feature components have second moment equal to 1.In this case, the numerator and the denominator can be directly expressed in terms of the order parameters of the model: where The order parameters q d , p d can be computed by solving saddle point equations as shown in Appendix B.1.Notice that in the case of a linear activation function the BMD is always 1 since a flip in the inputs will induce always the same response.
In Fig. 1 we show the plot of the generalization error and the corresponding BMD of the RFM at a fixed α T , as a function of 1/α for the MSE (left panels) and CE loss (right panels).As shown in [Gerace et al., 2020], for small regularization λ the generalization error develops a peak approximately where the model starts to fit all training data.In the case of the MSE loss, this threshold is often called interpolation threshold and it is located at N = P .When using the CE loss, this happens when the projected data become linearly separable and the exact location of the threshold strongly depends on the input statistics and features.Exactly in the correspondence of the generalization error peak the BMD displays its own peak, meaning that the function implemented by the network is more sensitive to perturbation of the inputs.
An interesting insight can be deduced from the behavior of the BMD at the optimal value of regularization for the RFM (dashed red curves in Fig. 1).While the generalization error becomes monotonic as the over-parametrization is increased, the BMD still reaches a peak at first and then descends to 1 only in the kernel limit N/P → ∞.This might be surprising since the ground-truth linear model, the teacher, has BMD equal to 1 and one would expect the best generalizing RFM to achieve the best possible approximation of this function and therefore to match its BMD.However, blind minimization of the BMD is not compatible with good generalization, as seen from the performance of the RFM with very large regularization λ.The explanation of this comes from the architectural mismatch between the linear teacher and the RFM: according to the GET the RFM learning problem is equivalent to a linear problem with an additional noise with an intensity regulated by the degree of non-linearity of the activation function [d'Ascoli et al., 2020].This noise initially forces the under-parameterized RFM to overstretch its parameters to fit the data, causing an increased sensitivity to input perturbations.As the over-parameterization is increased, the RFM becomes equivalent to an optimally regularized linear model [Gerace et al., 2020] and the BMD slowly drops to 1 in this limit.
Note that in the large dataset limit, when α, α T → ∞ with α D = O(1), a secondary peak for the BMD of the RFM emerges around α D = 1, i.e. when the number of parameters of the RFM is the same as the number of input features.This peak is caused by the insurgence of singular values in the spectrum of the covariance matrix Ω and is more accentuated at lower values of the regularization.Since modern deep networks operate in a completely different regime from the large dataset limit specified above, we expect this secondary peak not to be visible in realistic settings.For example, in the above plots in the low regularization regime, this peak is overshadowed by the main BMD peak.We analyze this phenomenology in detail in appendix C.

Numerical results
In the following subsections, we explore numerically the robustness of the BMD phenomenology analyzed in the RFM, considering different types of data distribution, model architecture and learning task.
Furthermore, we show that adversarially initialized models also display higher BMD, and that the increased sensitivity associated with a large BMD can hinder the robustness of the model against random perturbations of the training inputs.
Finally, we show that the location of the BMD peak is robust to the choice of input statistics used for its measurement, even in non-i.i.d.settings.

Experimental setup
In the following subsections, each panel displays the performance of a large number of different model architectures with varying degree of over-parameterization, trained on different datasets.Except where specified otherwise, all model are initialized with the common Xavier method [Glorot and Bengio, 2010] and use the Adam optimizer [Kingma and Ba, 2014], with batch size 128 and learning rate 10 −4 .No specific early stopping criterion is implemented.As in other works analysing the double descent, we experiment with different levels of uniformly random label noise during training (which is introduced by corrupting a random fraction of labels), which tends to make the double descent peak more pronounced [Nakkiran et al., 2021].We discuss the effect of label noise below.

Model architectures
We consider different types of model architectures: • Random feature model (RFM), described above, where the number of hidden neurons in the first (fixed) layer controls the degree of over-parameterization.
• Two-layer fully-connected network (MLP) with tanh activation, where the number of hidden neurons in the first layer controls the degree of over-parameterization.
• ResNet-18: a family of minimal ResNet [He et al., 2016] architectures based on the implementation of [Nakkiran et al., 2021].The structure is finalized with fully connected and softmax layers.As in [Nakkiran et al., 2021], we control the over-parameterization of the model by changing the number of channels in the convolutional layers.Namely, the 4 ResNet blocks contain convolutional layers of widths [k, 2k, 4k, 8k], with k varying from 1 to 20.
Both RFM and two-layer fully connected networks in our experiments use hyperbolic tangent activation functions and have weights initialized from a Gaussian distribution and bias terms initialized with zeros.The loss function optimized during training is the cross-entropy loss with L 2 regularization (the intensity of the regularization is set to zero if not specified otherwise).

Data preprocessing
In the following experiments, we use continuous inputs during the training of the models, normalizing the input features to lie within the [−1, 1] interval.While such normalizations are common in preprocessing pipelines, here this procedure has also the benefit of matching the range of variability of the training inputs with that of the randomly i.i.d.sampled binary sequences used to estimate the BMD.We explore the effect of different normalization ranges in Appendix Sec.E.

MD and generalization peaks as a function of overparametrization
In Fig. 2 we show train and test error, and the BMD for an RFM trained with and without label noise on binary MNIST (even vs odd digits) as a function of the hidden layer width.In Fig. 3, we instead consider a two-layer MLP trained on 10-digits MNIST (varying width) and a ResNet-18 trained on CIFAR10 (varying number of channels), both with label noise.In the multi-label case, we are defining the BMD of the network as the average of the BMDs over the classes, where the output of the network is a vector of predicted log-probabilities for each class (i.e.there is a log-softmax activation in the last layer).

Position of the BMD peak
The BMD displays a peak around the point where the number of parameters of the model allows it to reach zero training error, in close correspondence with the generalization error peak.We find this phenomenology to be robust with respect to the model class, the dataset, and the over-parameterization procedure.Notice however, that standard optimizers based on SGD are able to implicitly regularize the trained models and can strongly reduce the peaking behavior, as already observed in the context of double descent.In the presented figures we introduced label-noise, which ensures the presence of over-fitting and is thus able to restore both peaks.An important observation is that, in order to see this phenomenology, it is not necessary to account for the training input distribution for the evaluation of the MD, which would not be possible in the case of real data.In fact, in the over-fitting regime, it is possible to detect an increased sensitivity of the neural network function for multiple input distributions, including the i.i.d.binary inputs entailed in the BMD evaluation.This is explored further in sub-section 5.6.
Asymptotic behavior of the BMD When the degree of parametrization of the model is further increased, the BMD decreases and settles on an asymptotic value.The decrease of the BMD in the number of parameters is faster with lower label noise, see Fig. 2 (left panel vs. right panel).The asymptotic value, reached in the limit of an infinite number of parameters, is taskand model-dependent.For example, in Fig. 2, the functions learned by the RFMs no longer approximate a linear model (BMD equal to 1), and are instead bound to higher values of the BMD.

Visibility of the BMD peak and Label Noise
The double-descent generalization peak can be a very subtle phenomenon when the learning task is too coherent and the noise level in the data is too weak.With this type of data, the phenomenon can be made more evident [Nakkiran et al., 2021] by adding label noise to the training data.This strategy naturally reduces the signal-to-noise ratio and increases the over-fitting potential during training.The BMD peak, however, seems to be easily identifiable even with zero label noise, (see left panel of Fig. 2) where the generalization peak is less pronounced.Note that the BMD does not require any data (neither training nor test) in order to be estimated, so it can be used as a black-box test for assessing the proximity to the separability threshold and therefore as a signal of over-fitting.

Impact of regularization
It has been shown that regularizing the model weakens the doubledescent peak and that, at the optimal value of the regularization intensity, the generalization error smoothly decreases with the degree of over-parameterization.Similarly, the BMD peak can be dampened by adding stronger regularization, as shown in Fig. 4.

BMD and Training Set Size
In this section, we investigate the effect of varying the number of training samples for a fixed model capacity and training procedure.By increasing the number of training samples, starting from a low number, the same model can switch from being over-to under-parameterized.Therefore increasing the number of training samples has two effects on the test error curve: on the one hand, increasing the number of training samples decreases the test error, shifting the test error curve mostly downwards.On the other hand, increasing the number of training samples increases the capacity at which the double descent peak occurs since a higher capacity is needed until the training set is effectively memorized.This shifts the test error curve (and the BMD curve) to the right.This effect can be seen in Figure 5.

BMD and Adversarial Initialization
In this section, we analyze the BMD of two-layer fully connected networks under adversarial initialization [Liu et al., 2020] on the MNIST dataset.This initialization scheme can be used to artificially hinder the generalization performance of the model, forcing it to converge on a bad minimum of the loss.We here aim to show that the initialization has also an effect on the BMD of the model, increasing the sensitivity of the network.
The adversarial initialization protocol works as follows.We train a two-layer fully connected network in two different phases: in the first phase, we push the network towards an adversarial initialization by pretraining the model with 100% label noise for a fixed amount of epochs; in the second phase, we train the model on the original dataset, with no label noise, for 200 epochs.The resulting plot, in Fig. 6 (left panel), represents an average over 15 different realizations of the experiment and shows the effect of the length of the pretraining phase on both generalization performance and BMD of the network.In agreement with our analysis, we observe a simultaneous increase of the two metrics when the adversarial initialization phase is longer and the network is driven towards worse generalization., and test error (violet points) estimated for a two-layer fully connected network of width 10 3 , trained according to the adversarial initialization protocol described in section 5.3 on 20K samples of the MNIST dataset.On the horizontal axis we vary the number of pretraining epochs, and plot the corresponding increase in the generalization error and the BMD of the model after the second learning stage.The points represent an average of 15 different realizations of the experiment.(Right): BMD (orange line) and counts (turquoise line) estimated for a two-layer fully connected network trained on the MNIST dataset using 20K train samples with 20% label noise and tested on 5K samples.Counts represent the average amount of sign flips of random pixels of a correctly predicted test image that are necessary to fool the model to a wrong class label.The amount is averaged over all the correctly predicted test data samples.The resulting plot represents an average of 40 different realizations of the experiment.We observe that higher values of the BMD correspond to lower robustness of the model and vice versa.

BMD and Robustness Against Adversarial Attacks
In this section, we analyze the connection between BMD of a model and its robustness to adversarial attacks.We consider a two-layer fully-connected network trained on MNIST with 10 classes.We define as our robustness measure the average count of sign flips of randomly chosen pixels, needed to change the model prediction on a test sample that was previously classified correctly.The lower the counts, the lower the robustness of the model.Varying the capacity of the model by varying the width of the hidden layer, we plot this robustness measure against the BMD of the model in Fig. 6 (right panel).We observe that BMD and robustness strongly anti-correlate, with the peak in BMD coinciding with a minimum of robustness.

Pixel-Wise Contributions to BMD
The MD as expressed in eq ( 15) is proportional to a sum of contributions τ 2 i of single features indexed by i.Similar to [Hahn et al., 2022], we plot these contributions in Fig. 7 as a heatmap, where the bright spots indicate features that contribute strongly to the MD.We show four heatmaps, corresponding to different capacities and at different distances from the BMD peak, for a two-layer fully connected network trained on MNIST.
Note that the colors are normalized to the [0, 1] range, so that very bright spots correspond to pixels that contribute to the BMD the most.It can be seen that for under-parametrized networks few pixels give the largest contribution to the BMD.Near the BMD peak, a large fraction of the pixels in the center of the image dominate the BMD, and for even larger capacities we again have fewer pixels with maximal values.This can be interpreted as the classifier losing "focus" at the interpolation point and paying attention to fewer patterns in the over-parametrized regime.
Figure 7: Heatmaps of the pixel contributions (τ 2 i for 1 ≤ i ≤ 784) estimated on the two-layer fully connected network trained on 20K samples of the MNIST dataset with 10 classes, 20% label noise and normalized to lie within [0, 1] interval.The (rescaled) participation ratio is defined as After the rescaling, a participation ratio of 1 indicates a uniform distribution of pixel contributions, while a value of n indicates a distribution concentrated over a single pixel.The heatmaps correspond to the contributions estimated with respect to label 0 for the models of different capacities (hidden layer dimensions) and represent only one seed, while the resulting curves on the plot represent an average over 20 different runs of the experiment.

Different Distributions for Estimating BMD
In BMD estimates for the previous experiments, Eq. ( 16), we focused on the case of i.i.d.binary input features.In the RFM, however, we have shown analytically that there exists a universality for the MD when one considers separable input distributions with the same first and second moments.In the numerical experiments, we have also shown evidence that the BMD peak can still provide insights into the behavior of the neural network function on the training and test data, which follow very different input statistics.To explore in detail the role of the input statistics, and of the presence of correlations in the input features, we measure the MD by resampling the inputs from different distributions: in Fig. 8 we plot the normalized MD curves for features sampled from: • a uniform binary distribution (BMD).
• a standard normal (Gaussian) distribution N (0, 1).As one can see in Fig. 8, the MD curves estimated with binary and Gaussian i.i.d.inputs, with matching moments, are identical.With the uniform distribution, the second moment is 1/3 and this results in a slightly rescaled MD curve.Introducing correlations in the inputs, in the MD estimated over the training data distribution, the curve still shows a similar behavior, and importantly the peak is found at the same value.The choice of the distribution does not affect the location of the peak.Moreover, distributions, which first two moments coincide (e.g.binary uniform Unif{−1, 1} and Gaussian N (0, 1)) yield the same MD pattern.The resulting plot represents an average over 20 different runs of the experiment.

Discussion
In this work, we analyzed the Boolean Mean Dimension as a tool for assessing the sensitivity of neural network functions.In the treatable setting of the Random Feature Model, we derived an exact characterization of the behavior of this metric as a function of the degree of overparameterization of the model.Notably, we found a strong correlation between the sharp increase of the BMD and the increase of the generalization error around the interpolation threshold.This finding indicates that as the neural network starts to overfit the noise in the data, the learned function becomes more sensitive to small perturbations of the input features.Importantly, while the double descent curve requires test data to be observed, the BMD can signal this type of failure mode by using information from the neural network alone.The same phenomenology appears in more realistic scenarios with different architectures and datasets, where factors influencing double descent, like regularization and label noise, are also found to affect the BMD in similar fashion.Furthermore, we demonstrated that the BMD is informative about the vulnerability of trained models to adversarial attacks, despite assuming an input distribution that is very different from that of the training dataset.
Our study raises intriguing questions regarding the potential applications of BMD for regularization purposes.Another interesting future direction could be to investigate how comparing the BMDs achieved by a highly parametrized neural network trained on different datasets can help assess the effective dimensionality of the training data and the complexity of the discriminative tasks.Finally, it could be interesting to extend the study of the BMD in the RFM in the framework of a polynomial teacher model, recently analyzed in [Aguirre-López et al., 2024].
Proof.To show that we consider: The term in the squared brackets can be computed.We need to distinguish two cases, i.e. i ∈ v, and i / ∈ v.In the first case we get u⊇v,u∋i where δ i,j is the Kronecker delta function.
i.e. we get a non-zero result only if In the following we will denote by x ⊕i a vector x with a resampled i th coordinate.We are now ready to prove the following theorem: Theorem A.3.The mean dimension can be written as where Proof.We can write the numerator of the mean dimension as In the first equality we divided the summation over the sets into a double summation over the size k of the set and a summation over the sets of fixed size k.In the second equality we have used the fact that the summation over k can be interpreted as a summation over the indices i of the variable x; the inner sum can be therefore written as a summation over the sets (of any possible size) that contain the variable i itself.Reminding that and using Lemma A.2 and orthogonality of the coefficients of the ANOVA expansion, we have

B Replica computation of the Boolean Mean Dimension in the random feature model B.1 Free entropy
We review here the replica calculations of the free entropy of the model, defined as This in turn will give the information necessary to compute the BMD.The average over the dataset in ( 47) can be performed using the replica trick In the following we will consider n as an integer and we will denote by a, b as replica indices running from 1 to n.

Gaussian equivalence theorem
In order to compute the average over the input patterns of Z n β , we will apply a central limit theorem [Pennington and Worah, 2017], valid in the thermodynamic limit where N , D, P go to infinity with fixed α ≡ P N and α D ≡ D N .In the statistical physics literature this central limit is often called Gaussian equivalence theorem (GET) [Goldt et al., 2019, Loureiro et al., 2021].It can indeed be shown that the model is equivalent to a Gaussian covariate model [Mei and Montanari, 2019] and the following identification can be made where η i is Gaussian noise with zero mean and unit variance and we have defined the following coefficients with Dz ≡ e −z 2 /2 √ 2π dz.In the following we will considered for simplicity σ(•) to be an odd activation function, so that in this case κ 0 = 0.
Average over the dataset Using GET, we therefore get where the correlation matrix of the n + 1-dimensional multivariate Gaussian N is Here ρ = 1 D D k=1 (w T k ) 2 = 1, since the teacher has fixed norm.In the previous equation we have also denoted by Q ab and M a the following quantities where we have defined the projected student weights in the space of the teacher s a k as Average over Gaussian Features We can enforce the definition of the projected weights in (54) using delta functions and their integral representation.It then becomes easy to perform the average over random Gaussian features.We get a terms of the following form which only depends on the q ab defined in (53b).

Saddle point method
We can now impose the definitions of the order parameters Denoting by ⟨•⟩ D,F the average over both patterns and random features, the final result reads where and M a , Q ab are defined in terms of q ab , p ab , r a in (53).Notice that the integrals inside the "entropic" G S and the G SE tersm can be solved analytically by using multivariate Gaussian integrals identities.
We have also denoted by δQ = κ 2 ⋆ δq + κ 2 1 δp.Again the order parameters δq, δ q, q d , δ Q, δp, δ p, p d , δ P , r, r must be found self-consistently by solving the saddle point equations obtained by differentiating f .

Physical observables of interest
As shown in multiple papers, see e.g.[Gerace et al., 2020, Baldassi et al., 2022] the generalization error can be obtained by computing The training loss can be computed by a derivative with respect to β The test loss can be computed as follows [Baldassi et al., 2022] where ξ ⋆ represents a new extracted pattern, y ⋆ its corresponding label and ŷ⋆ the prediction of the model.

B.2 Analytical determination of the BMD
In the following subsections we will show the derivation of the annealed averages on the inputs of the numerator and denominator of the BMD.

B.2.1 Annealed average of the denominator of the BMD
The denominator in the definition of the BMD is easy to analyze.Applying GET, we obtain ŷ2 (w The average squared is instead simply given by ⟨ŷ(w; Therefore the denominator in the BMD reads ŷ2 (w; x) − ⟨ŷ(w; where Q d is the overlap obtained by the RS saddle point equations sketched in the previous section.

B.2.2 Annealed average of the numerator of the BMD
More work has to be done to compute the numerator of the BMD.We can write it as Taylor expanding the second term we have We then apply the GET and we take the average over the inputs getting where we have denoted for simplicity by κ0 , κ1 , κ2 , κ⋆ the coefficients in equation ( 50) for σ ′ .
Notice that in all the steps that we have done to arrive performing the average over the inputs we have not used anywhere the binary nature of the inputs.It is therefore easy to see that we would have obtained the same result if we had chosen a probability distribution on the inputs with the same first two moments (e.g., a standard normal distribution).Inserting this back into the previous equation, we get where we have introduced ) Therefore the mean dimension for a generic non-linearity σ is

B.2.3 BMD for an odd non-linearity σ
If we assume the activation function to be odd we have κ1 = 0, and κ0 = κ 1 .Therefore we can write the BMD in terms of the order parameters only  Notice that equation ( 79) can be used as an alternative (very efficient) way of computing the BMD, without using Monte Carlo method.We show in Fig. 9 how the difference between the BMD estimated using Monte Carlo and the one using equation ( 79) showing good agreement.
In the limit of 1/α → 0, at fixed α T = P/D (i.e.N → 0), the solution to the saddle point equations show that p d → q d ; in this limit the BMD goes to For example for σ(x) = tanh(x) activation we get M f = 1.1778 as it is displayed in Fig. 1.

C Double peak behavior of the BMD
As can be seen in Fig. 11, if α T is sufficiently large the BMD can display a peak in addition to the one located at the interpolation threshold (N = P for the MSE loss).This secondary peak is located at α = α T , i.e. when the number of parameters is equal to the input dimension N = D.This peak is not present in the generalization.We remark here that this behavior observed in the BMD is reminescent of the triple descent behaviour observed in [d'Ascoli et al., 2020], but is nonetheless different in nature.Indeed, in [d'Ascoli et al., 2020] the triple descent was observed in the test loss, when fixing N and D (i.e.α D = D/N ) and changing P 1 ; the authors observe a peak in the test loss when P = D in addition to the "classical" double descent peak when P = N .This "secondary" peak can be observed only if the activation function is linear σ(x) = x or if the labels are corrupted by Gaussian noise ζ µ ∼ N (0, 1) where ∆ is a non-negative parameter modulating the noise intensity.It is easy to show that the only term to change in the free energy (64) because of the noise is the energetic term which is modified as 1 this is different from our setting where we fix P and D i.e. αT = P/D and change N .We show in Fig. 11 that even if the test loss has a secondary peak when ∆ > 0 at P = D, this peak is not present in the BMD.
In Fig. 12, we show how the regularization λ can not only attenuate the "primary" peak at P = N , but also make the secondary peak disappear at N = D.

C.1 Explanation the secondary peak of the BMD around N = D
The root of this phenomenon can be found in the behavior of the spectrum of the covariance matrix Ω = F T F/D.To see this, we first consider a simplified setting where the phenomenon can be easily analytically traced.Consider a linear regression in the RFM with linear teacher.Calling X p ∈ R P ×N = σ(XF/ √ D) the projected inputs of the RFM, with X ∈ R P ×D , and F ∈ R D×N , the Gaussian Equivalence implies that the learning problem is equivalent to a linear regression with data: with Z ∈ R P ×N and Z ij ∼ N (0, 1).The labels Y ∈ R P are given by a linear teacher w T ∈ R D : The ordinary least square (OLS) estimator gives a closed form solution for the trained weights: By squaring this expression we can get the norm q d N = ∥w∥ 2 of the OLS estimator.We would like to understand the behavior of this quantity when P/N → ∞ and when D/N = α D = O(1) is varied.We can thus perform an annealed average over the dataset, averaging out X, Z, and w T .Since we are going to square the expression, we can take advantage of: and defining Ω = F T F/D we can simplify the expression as: If we now move to the eigenbasis of Ω = V ΛX T , where we also have that F/ √ D = U √ ΛV T , we can write this expression as a trace over the eigenvalues in Λ: Similarly, one can get an expression for the average overlap: So if we now focus on the ratio q d /Q d , which determines the fluctuations of the MD above M D = 1, we can see the impact of the spectrum of Ω, which follows a Marchenko-Pastur law with parameter 1/α D .When α D > 1 the spectrum is continuous and strictly positive, with a minimum eigenvalue ρ − = (1 − 1/α D ) 2 .At α D = 1 the spectrum touches the origin, and then at smaller values of α D (in the overparameterized regime of the RFM) the distribution splits into a delta in 0 with weight 1 − α D and a continuous component with increasing left extremum ρ − = (1 − 1/α D ) 2 and weight α D .Because of the additional ρ i in the numerator of expression (90), when the eigenvalues of Ω approach zero they have a larger effect in q d , therefore the MD reaches a peak.The same relationship between the parameters holds also at finite α and for a generic loss.The corresponding saddle-point equations read: (( m2 + q)κ 2 1 ρ i + qκ 2 ⋆ ) δ q(κ 2 1 ρ i + κ 2 ⋆ ) + λ 2 (91) where the loss function and the number of constraints determine the value of the conjugate parameters m, q, δ q, but the presence of an additional (κ 2 1 ρ i + κ 2 ⋆ ) in the numerator of Q d induces the same behavior of the MD around α D = D/N = 1, independent of the specific setting.
In Fig. 13 we show the plot of the overlaps q d , p d and Q d that confirms the intuitions showed above.

D Self-averaging property of the MD
The MD of the RFM at initialization demonstrates the self-averaging properties, e.g. for the case of i.i.d.gaussian weights that were projected on the space orthogonal to the I D vector.

D.1 Self-averaging of the MD for the trained models
In the

E BMD and Data Normalization
In the Fig. 15, we repeat the BMD calculations for RFMs trained with different ranges used for normalization.As can be seen in the figure the choice of the input data normalization does not affect the BMD pattern, but only its absolute values in this setting.

F Effect of the label corruption on the train error
In the Fig. 16 we demonstrate the effect of the label corruption of the train data on the train error, which is allowing to explain the difference between the test and train errors for smaller model widths.

Figure 2 :
Figure 2: Train error (turquoise), test error (violet) and BMD (orange) curves of the random feature model trained on the MNIST dataset with binary labels on 5K train samples with 0% label noise (left) and 10% label noise (right), tested on 5K samples.The resulting plots represent an average (and standard deviations) obtained repeating 20 different times the experiment.

Figure 3 :
Figure 3: (Left) Train error (turquoise), test error (violet) and BMD (orange) of the two-layer fully-connected network trained on the MNIST dataset with 10 labels on 20K train samples with 20% label noise, tested on the 5K samples.The resulting plot represents an average (and standard deviations) obtained repeating 20 different times the experiment.(Right) Train error, test error and BMD of ResNet-18 trained on the CIFAR-10 with 15% label noise in the train set.The resulting plot represents an average (and standard deviations) obtained repeating 5 different times the experiment.

Figure 4 :
Figure 4: Impact of training with L2 regularization of the random feature model using the MNIST dataset with 10 labels, 200 train samples with no label noise and evaluating on 5K test samples.The loss used is cross-entropy.In all plots, the curves are colored by the strength of the regularization weight λ. (Left) Regularization effect on the train error.(Center) Regularization effect on the test error.(Right) Regularization effect on the BMD.The generalization error smoothly decreases with the degree of over-parameterization.Similarly, the BMD peak can be dampened by adding stronger regularization.

Figure 5 :
Figure 5: Effect of changing the training set size on the test error (left panel) and the BMD (right panel) of the random feature model using the MNIST dataset with 10 labels, with no label noise and evaluating on 200 test samples.The resulting plot represents an average (and standard deviations) obtained repeating 15 different times the experiment.
Figure6: (Left): BMD (orange points), and test error (violet points) estimated for a two-layer fully connected network of width 10 3 , trained according to the adversarial initialization protocol described in section 5.3 on 20K samples of the MNIST dataset.On the horizontal axis we vary the number of pretraining epochs, and plot the corresponding increase in the generalization error and the BMD of the model after the second learning stage.The points represent an average of 15 different realizations of the experiment.(Right): BMD (orange line) and counts (turquoise line) estimated for a two-layer fully connected network trained on the MNIST dataset using 20K train samples with 20% label noise and tested on 5K samples.Counts represent the average amount of sign flips of random pixels of a correctly predicted test image that are necessary to fool the model to a wrong class label.The amount is averaged over all the correctly predicted test data samples.The resulting plot represents an average of 40 different realizations of the experiment.We observe that higher values of the BMD correspond to lower robustness of the model and vice versa.
• a uniform distribution in the range [−1, 1] .•empirical distribution of the training data with random uniform resampling in the range [−1, 1].

Figure 8 :
Figure 8: Mean dimensions estimated using Monte Carlo (eq.16) w.r.t.different distributions of the two-layer fully connected network trained on the 20K of MNIST samples with 20% label noise and tested on 5K samples.The MD values are normalized to lie within [0,1] interval.The choice of the distribution does not affect the location of the peak.Moreover, distributions, which first two moments coincide (e.g.binary uniform Unif{−1, 1} and Gaussian N (0, 1)) yield the same MD pattern.The resulting plot represents an average over 20 different runs of the experiment.

Figure 9 :
Figure 9: Difference between the BMD estimated using Monte Carlo method and using equation (79).

Figure 10 :
Figure 10: Generalization error (left) and BMD (right) as a function of 1/α for λ = 10 −4 and σ = tanh, for several values of α T = P/D.The loss used is the MSE.For low values of α T the BMD displays a double descent behavior as showed in the main text.Increasing α T , contrary to generalization, the BMD shows a triple descent behavior: a secondary peak in the BMD appears at α = α T , i.e.N = D (dashed vertical lines).

Figure 11 :
Figure 11: Test loss (left) and BMD (right) as a function of 1/α for fixed α D = 0.1, λ = 10 −4 , σ = tanh and for 2 values of the noise in the labels ∆ = 0, 5.The loss used is the MSE.Even if the test loss displays a secondary peak at P = D, the BMD does not.

Figure 12 :
Figure 12: Generalization error (left) and BMD (right) as a function of 1/α for α T = 30 and σ = tanh for several values of the regularization λ.The loss used is the MSE.Increasing the regularization the peak corresponding to P = N disappears before the one located at N = D.

Figure 14 :
Figure 14: Correspondence of the empirical evidence of the self-averaging of the BMD computed for the RFM initialized with the gaussian weights (projected orthogonally to the I D vector) and theoretical prediction of the asymptotic BMD value in the limit of the input size D and the hidden layer size N .(Left) RFM with leaky ReLU activation, (Right) RFM with tanh activation.The resulting plot represents an average over 70 different runs of the experiment.

Table 1 :
Table1below we can observe the phenomenon of the concentration of the MD in the trained deep learning models as compared to the MD of the models at initialization.MD empirical variances for randomly initialized and trained on cifar-10 dataset deep models estimated over 30 seeds