This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Brought to you by:
Paper The following article is Open access

Relative stability toward diffeomorphisms indicates performance in deep nets*

, , and

Published 24 November 2022 © 2022 The Author(s). Published on behalf of SISSA Medialab srl by IOP Publishing Ltd
, , Machine Learning 2022 Citation Leonardo Petrini et al J. Stat. Mech. (2022) 114013 DOI 10.1088/1742-5468/ac98ac

1742-5468/2022/11/114013

Abstract

Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations Rf correlates remarkably with the test error epsilont. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures. For CIFAR10 and 15 known architectures we find ${{\epsilon}}_{\text{t}}\approx 0.2\sqrt{{R}_{f}}$, suggesting that obtaining a small Rf is important to achieve good performance. We study how Rf depends on the size of the training set and compare it to a simple model of invariant learning.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Deep learning algorithms (LeCun et al 2015) are now remarkably successful at a wide range of tasks (Amodei et al 2016, Huval et al 2015, Mnih et al 2013, Shi et al 2016, Silver et al 2017). Yet, understanding how they can classify data in large dimensions remains a challenge. In particular, the curse of dimensionality associated with the geometry of space in large dimension prohibits learning in a generic setting (von Luxburg and Bousquet 2004). If high-dimensional data can be learnt, then they must be highly structured.

A popular idea is that during training, hidden layers of neurons learn a representation (Le 2013) that is insensitive to aspects of the data unrelated to the task, effectively reducing the input dimension and making the problem tractable (Ansuini et al 2019, Recanatesi et al 2019, Shwartz-Ziv and Tishby 2017). Several quantities have been introduced to study this effect empirically. It includes (i) the mutual information between the hidden and visible layers of neurons (Saxe et al 2019, Shwartz-Ziv and Tishby 2017), (ii) the intrinsic dimension of the neural representation of the data (Ansuini et al 2019, Recanatesi et al 2019) and (iii) the projection of the label of the data on the main features of the network (Kopitkov and Indelman 2020, Oymak et al 2019, Paccolat et al 2021a), the latter being defined from the top eigenvectors of the Gram matrix of the neural tangent kernel (Jacot et al 2018). All these measures support that the neuronal representation of the data indeed becomes well-suited to the task. Yet, they are agnostic to the nature of what varies in the data that need not being represented by hidden neurons, and thus do not specify what it is.

Recently, there has been a considerable effort to understand the benefits of learning features for one-hidden-layer fully connected (FC) nets. Learning features can occur and improve performance when the true function is highly anisotropic, in the sense that it depends only on a linear subspace of the input space (Bach 2017, Chizat and Bach 2020, Ghorbani et al 2019, Ghorbani et al 2020, Paccolat et al 2021a, Refinetti et al 2021, Yehudai and Shamir 2019). For image classification, such an anisotropy would occur for example if pixels on the edge of the image are unrelated to the task. Yet, fully-connected nets (unlike CNNs) acting on images tend to perform best in training regimes where features are not learnt (Geiger et al 2021, Geiger et al 2020, Lee et al 2020), suggesting that such a linear invariance in the data is not central to the success of deep nets.

Instead, it has been proposed that images can be classified in high dimensions because classes are invariant to smooth deformations or diffeomorphisms of small magnitude (Bruna and Mallat 2013, Mallat 2016). Specifically, Mallat and Bruna could handcraft convolution networks, the scattering transforms, that perform well and are stable to smooth transformations, in the sense that ||f(x) − f(τx)|| is small if the norm of the diffeomorphism τ is small too. They hypothesized that during training deep nets learn to become stable and thus less sensitive to these deformations, thus improving performance. More recent works generalize this approach to more common CNNs and discuss stability at initialization (Bietti and Mairal 2019a, Bietti and Mairal 2019b). Interestingly, enforcing such a stability can improve performance (Kayhan and van Gemert 2020).

Answering if deep nets become more stable to smooth deformations when trained and quantifying how it affects performance remains a challenge. Recent empirical results revealed that small shifts of images can change the output a lot (Azulay and Weiss 2018, Dieleman et al 2016, Zhang 2019), in apparent contradiction with that hypothesis. Yet in these works, image transformations (i) led to images whose statistics were very different from that of the training set or (ii) were cropping the image, thus are not diffeophormisms. In (Ruderman et al 2018), a class of diffeomorphisms (low-pass filter in spatial frequencies) was introduced to show that stability toward them can improve during training, especially in architectures where pooling layers are absent. Yet, these studies do not address how stability affects performance, and how it depends on the size of the training set. To quantify these properties and to find robust empirical behaviors across architectures, we will argue that the evolution of stability toward smooth deformations needs to be compared relatively to that of any deformation, which turns out to vary significantly during training.

Note that in the context of adversarial robustness, attacks that are geometric transformations of small norm that change the label have been studied (Alaifari et al 2018, Alcorn et al 2019, Athalye et al 2018, Engstrom et al 2019, Fawzi and Frossard 2015, Kanbak et al 2018, Xiao et al 2018). These works differ for the literature above and from out study below in the sense that they consider worst-case perturbations instead of typical ones.

1.1. Our contributions

  • We introduce a maximum entropy distribution of diffeomorphisms, that allow us to generate typical diffeomorphisms of controlled norm. Their amplitude is governed by a 'temperature' parameter T.
  • We define the relative stability to diffeomorphisms index Rf that characterizes the square magnitude of the variation of the output function f with respect to the input when it is transformed along a diffeomorphism, relatively to that of a random transformation of the same amplitude. It is averaged on the test set as well as on the ensemble of diffeomorphisms considered.
  • We find that at initialization, Rf is close to unity for various data sets and architectures, indicating that initially the output is as sensitive to smooth deformations as it is to random perturbations of the image.
  • Our central result is that after training, Rf correlates very strongly with the test error epsilont: during training, Rf is reduced by several decades in current state of the art (SOTA) architectures on four benchmark datasets including MNIST (Lecun et al 1998), FashionMNIST (Xiao et al 2017), CIFAR-10 (Krizhevsky 2009) and ImageNet (Deng et al 2009). For more primitive architectures (whose test error is higher) such as FC nets or simple CNNs, Rf remains of order unity. For CIFAR10 we study 15 known architectures and find empirically that ${{\epsilon}}_{\text{t}}\approx 0.2\sqrt{{R}_{f}}$.
  • Rf decreases with the size of the training set P. We compare it to an inverse power 1/P expected in simple models of invariant learning (Paccolat et al 2021a).

The library implementing diffeomorphisms on images is available online at github. com/pcsl-epfl/diffeomorphism.

The code for training neural nets can be found at github.com/leonardopetrini/diffeo-sota and the corresponding pre-trained models at https://doi.org/10.5281/zenodo. 5589870.

2. Maximum-entropy model of diffeomorphisms

2.1. Definition of maximum entropy model

We consider the case where the input vector x is an image. It can be thought as a function x(s) describing intensity in position s = (u, v) ∈ [0, 1]2, where u and v are the horizontal and vertical coordinates. To simplify notations we consider a single channel, in which case x(s) is a scalar (but our analysis holds for colored images as well).

We denote by τx the image deformed by τ, i.e. [τx](s) = x(sτ(s)).

τ(s) is a vector field of components (τu (s), τv (s)). The deformation amplitude is measured by the norm

Equation (1)

To test the stability of deep nets toward diffeomorphisms, we seek to build typical diffeomorphisms of controlled norm ||∇τ||. We thus consider the distribution over diffeomorphisms that maximizes the entropy with a norm constraint. It can be solved by introducing a Lagrange multiplier T and by decomposing these fields on their Fourier components, see e.g. Kardar (2007) or appendix A. In this canonical ensemble, one finds that τu and τv are independent with identical statistics. For the picture frame not to be deformed, we impose fixed boundary conditions: τ = 0 if u = 0, 1 or v = 0, 1. One then obtains:

Equation (2)

where the Cij are Gaussian variables of zero mean and variance $\langle {C}_{ij}^{2}\rangle =T/({i}^{2}+{j}^{2})$. If the picture is made of n × n pixels, the result is identical except that the sum runs on 0 < i, jn. For large n, the norm then reads ||∇τ||2 = (π2/2) n2 T, and is dominated by high spatial frequency modes. It is useful to add another parameter c to cut-off the effect of high spatial frequencies, which can be simply done by constraining the sum in equation (2) to i2 + j2c2, one then has ||∇τ||2 = (π3/8) c2 T.

Once τ is generated, pixels are displaced to random positions. A new pixelated image can then be obtained using standard interpolation methods. We use two interpolations, Gaussian and bi-linear 1 , as described in appendix C. As we shall see below, this choice does not affect our result as long as the diffeomorphism induced a displacement of order of the pixel size, or larger. Examples are shown in figure 1 as a function of T and c.

Figure 1.

Figure 1. Samples of max-entropy diffeomorphisms for different temperatures T and high-frequency cut-offs c for an ImageNet data-point of resolution 320 × 320. The green region corresponds to well behaving diffeomorphisms (see section 2.2). The dashed line corresponds to δ = 1. The colored points on the line are those we focus our study in section 3.

Standard image High-resolution image

2.2. Phase diagram of acceptable diffeomorphisms

Diffeomorphisms are bijective, which is not the case for our transformations if T is too large. When this condition breaks down, a single domain of the picture can break into several pieces, as apparent in figure 1. It can be expressed as a condition on ∇τ that must be satisfied in every point in space (Lowe 2004), as recalled in appendix B. This is satisfied locally with high probability if ||τ||2 ≪ 1, corresponding to T ≪ (8/π3)/c2. In appendix, we extract empirically a curve of similar form in the (T, c) plane at which a diffeomorphism is obtained with probability at least 1/2. For much smaller T, diffeomorphisms are obtained almost surely.

Finally, for diffeomorphisms to have noticeable consequences, their associated displacement must be of the order of magnitude of the pixel size. Defining δ2 as the average square norm of the pixel displacement at the center of the image in the unit of pixel size, it is straightforward to obtain from equation (2) that asymptotically for large c (cf appendix B for the derivation),

Equation (3)

The line δ = 1/2 is indicated in figure 1, using empirical measurements that add pre-asymptotic terms to equation (3). Overall, the green region corresponds to transformations that (i) are diffeomorphisms with high probability and (ii) produce significant displacements at least of the order of the pixel size.

3. Measuring the relative stability to diffeomorphisms

Relative stability to diffeomorphisms. To quantify how a deep net f learns to become less sensitive to diffeomorphisms than to generic data transformations, we define the relative stability to diffeomorphisms Rf as:

Equation (4)

where the notation ⟨⟩y can indicate alternatively the mean or the median with respect to the distribution of y. In the numerator, this operation is made over the test set and over the ensemble of diffeomorphisms of parameters (T, c) (on which Rf implicitly depends). In the denominator, the average is on the test set and on the vectors η sampled uniformly on the sphere of radius ||η|| = ⟨||τxx||⟩x,τ . An illustration of what Rf captures is shown in figure 2. In the main text, we consider median quantities, as they reflect better the typical values of distribution. In appendix E.3 we show that our results for mean quantities, for which our conclusions also apply.

Figure 2.

Figure 2. Illustrative drawing of the data-space ${\mathbb{R}}^{n\times n}$ around a data-point x (black point). We focus here on perturbations of fixed magnitude—i.e. on the sphere of radius r centered in x. The intersection between the images of x transformed via typical diffeomorphisms and the sphere is represented in dashed green. By contrast, the red point is an example of random transformation. For large n, it is equivalent to adding an i.i.d. Gaussian noise to all the pixel values of x. Figures on the right illustrate these transformations, the color of the dot labelling them corresponds to that of the left illustration. The relative stability to diffeomorphisms Rf characterizes how a net f varies in the green directions, normalized by random ones.

Standard image High-resolution image

Dependence of R f on the diffeomorphism magnitude. Ideally, Rf could be defined for infinitesimal transformations, as it would then characterize the magnitude of the gradient of f along smooth deformations of the images, normalized by the magnitude of the gradient in random directions. However, infinitesimal diffeomorphisms move the image much less than the pixel size, and their definition thus depends significantly on the interpolation method used. It is illustrated in the left panels of figure 3, showing the dependence of Rf in terms of the diffeomorphism magnitude (here characterised by the mean displacement magnitude at the center of the image δ) for several interpolation methods. We do see that Rf becomes independent of the interpolation when δ becomes of order unity. In what follows we thus focus on Rf (δ = 1), which we denote Rf .

Figure 3.

Figure 3. Relative stability to diffeomorphisms Rf for SOTA architectures. (Left panels) Rf vs diffeomorphism displacement magnitude δ at initialization (dashed lines) and after training (full lines) on the full data set of CIFAR10 (P = 50k) for several cut-off parameters c and two interpolations methods, as indicated in legend. ResNet is shown on the top and EfficientNet on the bottom. (Central panels) Rf (δ = 1) for four different data-sets (x-axis) and two different architectures at initialization (shaded histograms) and after training (full histograms). The values of c (in different colors) are (3, 5, 15) and (3, 10, 30) for the first three data-sets and ImageNet, respectively. ResNet18 and EfficientNetB0 are employed for MNIST, F-MNIST and CIFAR10, ResNet101 and EfficientNetB2 for ImageNet. (Right panels) Rf (δ = 1) vs training set size P at c = 3 for ResNet18 (top) and EfficientNetB0 (bottom) trained on CIFAR10. The value of ${R}_{{f}_{0}}$ at initialization is indicated with dashed lines. The triangles indicate the predicted slope Rf P−1 in a simple model of invariant learning, see section 6. Statistics: each point in the graphs 2 is obtained by training 16 differently initialized networks on 16 different subsets of the data-sets; each network is then probed with 500 test samples in order to measure stability to diffeomorphisms and Gaussian noise. The resulting Rf is obtained by log-averaging the results from single realizations.

Standard image High-resolution image

SOTA architectures become relatively stable to diffeomorphisms during training, but are not at initialization. The central panels of figure 3 show Rf at initialization (shaded), and after training (full) for two SOTA architectures on four benchmark data sets. The first key result is that, at initialization, these architectures are as sensitive to diffeomorphisms as they are to random transformations. Relative stability to diffeomorphisms at initialization (guaranteed theoretically in some cases (Bietti and Mairal 2019a, Bietti and Mairal 2019b)) thus does not appear to be indicative of successful architectures.

By contrast, for these SOTA architectures, relative stability toward diffeomorphisms builds up during training on all the data sets probed. It is a significant effect, with values of Rf after training generally found in the range Rf ∈ [10−2, 10−1].

Standard data augmentation techniques (translations, crops, and horizontal flips) are employed for training. However, the results we find only mildly depend on using such techniques, see figure 12 in the appendix.

Learning relative stability to diffeos requires large training sets. How many data are needed to learn relative stability toward diffeomorphisms? To answer this question, newly initialized networks are trained on different training-sets of size P. Rf is then measured for CIFAR10, as indicated in the right panels of figure 3. Neural nets need a certain number of training points (P ∼ 103) in order to become relatively stable toward smooth deformations. Past that point, Rf monotonically decreases with P. In a range of P, this decrease is approximately compatible with the an inverse behavior Rf ∼ 1/P found in the simple model of section 6. Additional results for MNIST and FashionMNIST can be found in figure 13, appendix E.3.

Simple architectures do not become relatively stable to diffeomorphisms. To test the universality of these results, we focus on two simple architectures: (i) a four-hidden-layer FC network (FullConn-L4) where each hidden layer has 64 neurons and (ii) LeNet (LeCun et al 1989) that consists of two convolutional layers followed by local max-pooling and three fully-connected layers.

Measurements of Rf for these networks are shown in figure 4. For the FC net, Rf ≈ 1 at initialization (as observed for SOTA nets) but grows after training on the full data set, showing that FC nets do not learn to become relatively stable to smooth deformations. It is consistent with the modest evolution of Rf (P) with P, suggesting that huge training sets would be required to obtain Rf < 1. The situation is similar for the primitive CNN LeNet, which only becomes slightly insensitive (Rf ≈ 0.6) in a single data set (CIFAR10), and otherwise remains larger than unity.

Figure 4.

Figure 4. Relative stability to diffeomorphisms Rf in primitive architectures. (Top panels) Rf at initialization (shaded) or for trained nets (full) for a FC net (left) or a primitive CNN (right) at P = 50k. (Bottom panels) Rf (P) for c = 3 and different data sets as indicated in legend. Statistics: see caption in the previous figure.

Standard image High-resolution image

Layers' relative stability monotonically increases with depth. Up to this point, we measured the relative stability of the output function for any given architecture. We now study how relative stability builds up as the input data propagate through the hidden layers. In figure 14 of appendix E.3, we report Rf as a function of depth for both simple and deep nets. What we observe is ${R}_{{f}_{0}}\approx 1$ independently of depth at initialization, and monotonically decreases with depth after training. Overall, the gain in relative stability appears to be well-spread through the net, as is also found for stability alone (Ruderman et al 2018).

4. Relative stability to diffeomorphisms indicates performance

Thus, SOTA architectures appear to become relatively stable to diffeomorphisms after training, unlike primitive architectures. This observation suggests that high performance requires such a relative stability to build up. To test further this hypothesis, we select a set of architectures that have been relevant in the SOTA progress over the past decade; we systematically train them in order to compare Rf to their test error epsilont. Apart from FC nets, we consider the already cited LeNet (5 layers and $\approx 60k$ parameters); then AlexNet (Krizhevsky et al 2012) and VGG (Simonyan and Zisserman 2015), deeper (8–19 layers) and highly over-parametrized (10–20M (million) params.) versions of the latter. We introduce batch-normalization in VGGs and skip connections with ResNets. Finally, we go to EfficientNets, that have all the advancements introduced in previous models and achieve SOTA performance with a relatively small number of parameters (<10M); this is accomplished by designing an efficient small network and properly scaling it up. Further details about these architectures can be found in table 1, appendix E.2.

The results are shown in figure 5. The correlation between Rf and epsilont is remarkably high (corr. coeff. 3 : 0.97), suggesting that generating low relative sensitivity to diffeomorphisms Rf is important to obtain good performance. In appendix E.3 we also report how changing the train set size P affects the position of a network in the (epsilont, Rf ) plane, for the four architectures considered in the previous section (figure 18). We also show that our results are robust to changes of δ, c (figure 21) and data sets (figure 20).

Figure 5.

Figure 5. Test error epsilont vs relative stability to diffeomorphisms Rf computed at δ = 1 and c = 3 for common architectures when trained on the full ten-classes CIFAR10 dataset (P = 50k) with SGD and the cross-entropy loss; the EfficientNets achieving the best performance are trained by transfer learning from ImageNet (⋆)—more details on the training procedures can be found in appendix E.1. The color scale indicates depth, and the symbols the presence of batch-norm (◊) and skip connections (†). Dashed grey line: power low fit ${{\epsilon}}_{\text{t}}\approx 0.2\sqrt{{R}_{f}}$. Rf strongly correlates to epsilont, much less so to depth or the presence of skip connections. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Rf . The results are obtained by log-averaging over single realizations. Error bars—omitted here—are shown in figure 19, appendix E.3.

Standard image High-resolution image

What architectures enable a low Rf value? The latter can be obtained with skip connections or not, and for quite different depths as indicated in figure 5. Also, the same architecture (EfficientNetB0) trained by transfer learning from ImageNet—instead of directly on CIFAR10—shows a large improvement both in performance and in diffeomorphisms invariance. Clearly, Rf is much better predicted by epsilont than by the specific features of the architecture indicated in figure 5.

5. Stability toward diffeomorphisms vs noise

The relative stability to diffeomorphisms Rf can be written as Rf = Df /Gf where Gf characterizes the stability with respect to additive noise and Df the stability toward diffeomorphisms:

Equation (5)

Here, we chose to normalize these stabilities with the variation of f over the test set (to which both x and z belong), and η is a random noise whose magnitude is prescribed as above. Stability toward additive noise has been studied previously in FC architectures (Novak et al 2018) and for CNNs as a function of spatial frequency in Tsuzuku and Sato (2019), Yin et al (2019).

The decrease of Rf with growing training set size P could thus be due to an increase in the stability toward diffeomorphisms (i.e. Df decreasing with P) or a decrease of stability toward noise (Gf increasing with P). To test these possibilities, we show in figure 6 Gf (P), Df (P) and Rf (P) for MNIST, Fashion MNIST and CIFAR10 for two SOTA architectures. The central results are that (i) stability toward noise is always reduced for larger training sets. This observation is natural: when more data needs to be fitted, the function becomes rougher. (ii) Stability toward diffeomorphisms does not behave universally: it can increase with P or decrease depending on the architecture and the training set. Additionally, Gf and Df alone show a much smaller correlation with performance than Rf —see figures 1517 in appendix E.3.

Figure 6.

Figure 6. Stability toward Gaussian noise (Gf ) and diffeomorphisms (Df ) alone, and the relative stability Rf . Columns correspond to different data-sets (MNIST, FashionMNIST and CIFAR10) and rows to architectures (ResNet18 and EfficientNetB0). Each panel reports Gf (blue), Df (orange) and Rf (green) as a function of P and for different cut-off values c, as indicated in the legend. Statistics: cf caption in figure 3. Error bars—omitted here—are shown in figure 22, appendix E.3.

Standard image High-resolution image

6. A minimal model for learning invariants

In this section, we discuss the simplest model of invariance in data where stability to transformation builds up, that can be compared with our observations of Rf above. Specifically, we consider the 'stripe' model (Paccolat et al 2021b), corresponding to a binary classification task for Gaussian-distributed data points x = (x||, x) where the label function depends only on one direction in data space, namely y(x) = y(x||). Layers of y = +1 and y = −1 regions alternate along the direction x||, separated by parallel planes. Hence, the data present d − 1 invariant directions in input-space denoted by x as illustrated in figure 7 (left).

Figure 7.

Figure 7. (Left) Example of the stripe model. Dots are data-points, the vertical lines represent the decision boundary and the color the class label. (Right) Relative stability Rf for the stripe model in d = 30. The slope of the curve is −1, as predicted.

Standard image High-resolution image

When this model is learnt by a one-hidden-layer FC net, the first layer of weights can be shown to align with the informative direction (Paccolat et al 2021a). The projection of these weights on the orthogonal space vanishes with the training set size P as $1/\sqrt{P}$, an effect induced by the sampling noise associated to finite training sets.

In this model, Rf can be defined as:

Equation (6)

where we made explicit the dependence of f on the two linear subspaces. Here, the isotropic noise ν is added only in the invariant directions. Again, we impose ||η|| = ||ν||. Rf (P) is shown in figure 7 (right). We observe that Rf (P) ∼ P−1, as expected from the weight alignment mentioned above.

Interestingly, figure 3 for CIFAR10 and SOTA architectures support that the 1/P behavior is compatible with the observations for some range of P. In appendix E.3, figure 13, we show analogous results for MNIST and Fashion-MNIST. We observe the 1/P power-law scaling for ResNets. It suggests that for these architectures, learning to become invariant to diffeomorphisms may be limited by a naive measure of sampling noise as well. By contrast for EfficientNets, in which the decrease in Rf is more limited, a 1/P behavior cannot be identified.

7. Discussion

A common belief is that stability to random noise (small Gf ) and to diffeomorphisms (small Df ) are desirable properties of neural nets. Its underlying assumption is that the true data label mildly depends on such transformations when they are small. Our observations suggest an alternative view:

  • (a)  
    Figures 6 and 16: better predictors are more sensitive to small perturbations in input space.
  • (b)  
    As a consequence, the notion that predictors are especially insensitive to diffeomorphisms is not captured by stability alone, but rather by the relative stability Rf = Df /Gf .
  • (c)  
    We propose the following interpretation of figure 5: to perform well, the predictor must build large gradients in input space near the decision boundary—leading to a large Gf overall. Networks that are relatively insensitive to diffeomorphisms (small Rf ) can discover with less data that strong gradients must be there and generalize them to larger regions of input space, improving performance and increasing Gf .

This last point can be illustrated in the simple model of section 6, see figure 7 (left) panel. Imagine two data points of different labels falling close to the, e.g. left true decision boundary. These two points can be far from each other if their orthogonal coordinates differ. Yet, if Rf = 0 (now defined in equation (6)), then the output does not depend on the orthogonal coordinates, and it will need to build a strong gradient—in input space—along the parallel coordinate to fit these two data. This strong gradient will exist throughout that entire decision boundary, improving performance but also increasing Gf . Instead, if Rf = 1, fitting these two data will not lead to a strong gradient, since they can be far from each other in input space. Beyond this intuition, in this model decreasing Rf can quantitatively be shown to increase performance, see Paccolat et al (2021b).

8. Conclusion

We have introduced a novel empirical framework to characterize how deep nets become invariant to diffeomorphisms. It is jointly based on a maximum-entropy distribution for diffeomorphisms, and on the realization that stability of these transformations relative to generic ones Rf strongly correlates to performance, instead of just the diffeomorphisms stability considered in the past.

The ensemble of smooth deformations we introduced may have interesting applications. It could serve as a complement to traditional data-augmentation techniques (whose effect on relative stability is discussed in figure 12 of the appendix). A similar idea is present in Hauberg et al (2016), Shen et al (2020) but our deformations have the advantage of being easier to sample and data agnostic. Moreover, the ensemble could be used to build adversarial attacks along smooth transformations, in the spirit of Alaifari et al (2018), Engstrom et al (2019), Kanbak et al (2018). It would be interesting to test if networks robust to such attacks are more stable in relative terms, and how such robustness affects their performance.

Finally, the tight correlation between relative stability Rf and test error epsilont suggests that if a predictor displays a given Rf , its performance may be bounded from below. The relationships we observe epsilont(Rf ) may then be indicative of this bound, which would be a fundamental property of a given data set. Can it be predicted in terms of simpler properties of the data? Introducing simplified models of data with controlled stability to diffeomorphisms beyond the toy model of section 6 would be useful to investigate this key question.

Acknowledgments

We thank Alberto Bietti, Joan Bruna, Francesco Cagnetta, Pascal Frossard, Jonas Paccolat, Antonio Sclocchi and Umberto M Tomasini for helpful discussions. This work was supported by a grant from the Simons Foundation (#454953 Matthieu Wyart).

Appendix A.: Maximum entropy calculation

Under the constraint on the borders, τu and τv can be expressed in a real Fourier basis as in equation (2). By injecting this form into ||∇τ||2 we obtain:

Equation (7)

where Dij are the Fourier coefficients of τv . We aim at computing the probability distributions that maximize their entropy while keeping the expectation value of ||∇τ||2 fixed. Since we have a sum of quadratic random variables, the equipartition theorem (Beale 1996) applies: the distributions are normal and every quadratic term contributes in average equally to ||∇τ||2. Thus, the variance of the coefficients follows $\frac{T}{{i}^{2}+{j}^{2}}$ where the parameter T determines the magnitude of the diffeomorphism.

Appendix B.: Boundaries of studied diffeomorphisms

Average pixel displacement magnitude δ . We derive here the large-c asymptotic behavior of δ (equation (3)). This is defined as the average square norm of the displacement field, in pixel units:

where we approximated the sum with an integral, in the third step. The asymptotic relations for ||∇τ|| that are reported in the main text are computed in a similar fashion. In figure 8, we check the agreement between asymptotic prediction and empirical measurements. If δ ≪ 1, our results strongly depend on the choice of interpolation method. To avoid it, we only consider conditions for which δ ⩾ 1/2, leading to

Equation (8)

Figure 8.

Figure 8. (Left) The characteristic displacement δ(c, T) is observed to follow ${\delta }^{2}\simeq \frac{\pi }{4}{n}^{2}T\,\mathrm{log}\,c$. (Right) Measurement of maxs Ξ supporting equation (13).

Standard image High-resolution image

Condition for diffeomorphism in the (T, c) plane. For a given value of c, there exists a temperature scale beyond which the transformation is not injective anymore, affecting the topology of the image and creating spurious boundaries, see figures 9(a)–(c) for an illustration. Specifically, consider a curve passing by the point s in the deformed image. Its tangent direction is u at the point s. When going back to the original image (s' = sτ(s)) the curve gets deformed and its tangent becomes

Equation (9)

A smooth deformation is bijective iff all deformed curves remain curves which is equivalent to have non-zero tangents everywhere

Equation (10)

Imposing ||u'|| ≠ 0 does not give us any constraint on τ. Therefore, we constraint τ a bit more and allow only displacement fields such that uu' > 0, which is a sufficient condition for equation (10) to be satisfied—cf figure 9(d). By extremizing over u, this condition translates into

Equation (11)

or, equivalently,

Equation (12)

were we identified by Ξ the lhs of the inequality. We find that the median of the maximum of Ξ over all the image (||Ξ(s)||) can be approximated by (see figure 8(b)):

Equation (13)

The resulting constraint on T reads

Equation (14)

Figure 9.

Figure 9. (a) Idealized image at T = 0. (b) Diffeomorphism of the image. (c) Deformation of the image at large T: colors get mixed-up together, shapes are not preserved anymore. (d) Allowed region for vector transformations under τ. For any point in the image s and any direction u, only displacement fields for which all the deformed direction u' is non-zero generate diffeomorphisms. The bound in equation (12) (u' ⋅ u > 0) correspond to the green region. The gray disc corresponds to the bound ||∇τ|| < 1.

Standard image High-resolution image

Appendix C.: Interpolation methods

When a deformation is applied to an image x, each of its pixels gets mapped, from the original pixels grid, to new positions generally outside of the grid itself—cf figures 9(a) and (b). A procedure (interpolation method) needs to be defined to project the deformed image back into the original grid.

For simplicity of notation, we describe interpolation methods considering the square [0, 1]2 as the region in between four pixels—see an illustration in figure 10(a). We propose here two different ways to interpolate between pixels and then check that our measurements do not depend on the specific method considered.

Figure 10.

Figure 10. (a) We consider the region between four pixels as the square [0, 1]2 where, after the application of a deformation τ, the pixel (0, 0) is mapped into (u, v). (b) Bi-linear interpolation: the value of x in (u, v) is computed by two steps of linear interpolation. First, we compute x in the red crosses, by averaging values on the vertical axis. Then, a line interpolates horizontally the values in the red crosses to give the result. (c) Gaussian interpolation: we denote by si the pixel positions in the original grid. The interpolated value of s in any point of the image is given by a weighted sum of n × n Gaussian centered in each si —in red.

Standard image High-resolution image

Bi-linear interpolation. The bi-linear interpolation consists, as the name suggests, of two steps of linear interpolation, one on the horizontal, and one on the vertical direction—figure 10(b). If we look at the square [0, 1]2 and we apply a deformation τ such that (0, 0) ↦ (u, v), we have

Equation (15)

Gaussian interpolation. In this case, a Gaussian function 4 is placed on top of each point in the grid—cf figure 10. The pixel intensity x can be evaluated at any point outside the grid by computing

Equation (16)

In order to fix the standard deviation σ of G, we introduce the participation ratio n. Given Ψi = G(s, si )|s=(0.5,0.5), we define

Equation (17)

The participation ratio is a measure of how many pixels contribute to the value of a new pixel, which results from interpolation. We fix σ in such a way that the participation ratio for the Gaussian interpolation matches the one for the bi-linear (n = 4), when the new pixel is equidistant from the four pixels around. This gives σ = 0.4715.

Notice that this interpolation method is such that it applies a Gaussian smoothing of the image even if τ is the identity. Consequently, when computing observables for f with the Gaussian interpolation, we always compare f(τx) to $f(\tilde{x})$, where $\tilde{x}$ is the smoothed version of x, in such a way that $f({\tau }^{[T=0]}x)=f(\tilde{x})$.

Empirical results dependence on interpolation. Finally, we checked to which extent our results are affected by the specific choice of interpolation method. In particular, blue and red colors in figures 3 and 13 correspond to bi-linear and Gaussian interpolation, respectively. The interpolation method only affects the results in the small displacement limit (δ → 0).

Note. Throughout the paper, if not specified otherwise, bi-linear interpolation is employed.

Appendix D.: Stability to additive noise vs noise magnitude

We introduced in section 5 the stability toward additive noise:

Equation (18)

We study here the dependence of Gf on the noise magnitude ||η||. In the η → 0 limit, we expect the network function to behave as its first-order Taylor expansion, leading to Gf ∝ ||η||2. Hence, for small noise, Gf gives an estimate of the average magnitude of the gradient of f in a random direction η.

Empirical results. Measurements of Gf on SOTA nets trained on benchmark data-sets are shown in figure 11. We observe that the effect of non-linearities start to be significant around ||η|| = 1. For large values of the noise—i.e. far away from data-points—the average gradient of f does not change with training.

Figure 11.

Figure 11. Stability to isotropic noise Gf as a function of the noise magnitude ||η|| for CIFAR10 (left) and ImageNet (right). The color corresponds to two different classes of SOTA architecture: ResNet and EfficientNet. The slope 2 at small ||η|| identifies the linear regime. For larger noise magnitudes, non-linearities appear.

Standard image High-resolution image

Appendix E.: Numerical experiments

In this appendix, we provide details on the training procedure, on the different architectures employed and some additional experimental results.

Figure 12.

Figure 12. Effect of data augmentation on Rf . Relative stability to diffeomorphisms Rf after training with different data augmentations: 'none' (1st group of bars in each plot) for no data augmentation, 'translation' (2nd bars) corresponds to training on randomly translated (by four pixels) and cropped inputs, and 'diffeo' (3rd bars) to training on randomly deformed images with max-entropy diffeomorphisms (T = 10−2, c = 1). Results are averaged over five trainings of ResNet18 on MNIST (left), FashionMNIST (center), CIFAR10 (right). Colors indicate different cut-off values when probing the trained networks. Different augmentations have a small quantitative, and no qualitative effect on the results. As expected, augmenting the input images with smooth deformations makes the net more invariant to such transformations.

Standard image High-resolution image

E.1. Image classification training set-up:

  • Trainings are performed in PyTorch, the code can be found here github.com/leonardopetrini/diffeo-sota.
  • Loss function: cross-entropy.
  • Batch size: 128.
  • Dynamics:
    • FC nets: ADAM with learning rate = 0.1 and no scheduling.
    • Transfer learning: SGD with learning rate = 10−2 for the last layer and 10−3 for the rest of the network, momentum = 0.9 and weight decay = 10−3. Both learning rates decay exponentially during training with a factor γ = 0.975.
    • All the other networks are trained with SGD with learning rate = 0.1, momentum = 0.9 and weight decay = 5 × 10−4. The learning rate follows a cosine annealing scheduling (Loshchilov and Hutter 2016).
  • Early-stopping is performed, i.e. results shown are computed with the network obtaining the best validation accuracy out of 250 training epochs.
  • For the experiments involving a training on a subset of the training date of size P < Pmax, the total number of epochs is accordingly re-scaled in order to keep constant the total number of optimizer steps.
  • Standard data augmentation is employed: different random translations and horizontal flips of the input images are generated at each epoch. As a safety check, we verify that the invariance learnt by the nets is not purely due to such augmentation (figure 12).
  • Experiments are run on 16 GPUs NVIDIA V100. Individual trainings run in $\sim 1$ hour of wall time. We estimate a total of a few thousands hours of computing time for running the preliminary and actual experiments present in this work.

The stripe model is trained with an approximation of gradient flow introduced in Geiger et al (2020), see Paccolat et al (2021a) for details.

A note on computing stabilities at init. in presence of batch-norm. We recall that batch-norm (BN) can work in either of two modes: training and evaluation. During training, BN computes the mean and variance on the current batch and uses them to normalize the output of a given layer. At the same time, it keeps memory of the running statistics on such batches, and this is used for the normalization steps at inference time (evaluation mode). When probing a network at initialization for computing stabilities, we put the network in evaluation mode, except for BN, which operates in train mode. This is because BN running mean and variance are initialized to 0 and 1, in such a way that its evaluation mode at initialization would correspond to not having BN at all, compromising the input signal propagation in deep architectures.

E.2. Networks architectures

All networks implementations can be found at github.com/leonardopetrini/diffeo-sota/tree/main/models. In table 1, we report salient features of the network architectures considered.

Table 1. Network architectures, main characteristics. We list here (columns) the classes of net architectures used throughout the paper specifying some salient features (depth, number of parameters, etc) for each of them.

FeaturesFullConnLeNet LeCun et al (1989)AlexNet Krizhevsky et al (2012)
Depth2, 4, 658
Num. parameters200k 62k 23M
FC layers2, 4, 633
ActivationReLUReLUReLU
Pooling/maxmax
Dropout//Yes
Batch norm///
Skip connections///
 VGGResNetEfficientNetB0-2
FeaturesSimonyan and Zisserman (2015)He et al (2016)Tan and Le (2019)
Depth11, 16, 1918, 34, 5018, 25
Num. parameters9–20M11–24M5, 9M
FC layers111
ActivationReLUReLUSwish
PoolingmaxAvg. (last layer only)Avg. (last layer only)
Dropout//Yes + dropconnect
Batch normIf 'bn' in nameYesYes
Skip connections/YesYes (inv. residuals)

Table 2. Test error vs stability: correlation coefficients for different data sets.

Data-set Df Gf Rf
MNIST0.71−0.430.75
SVHN0.87−0.280.81
FashionMNIST0.72−0.680.94
Tiny ImageNet0.69−0.660.74

E.3. Additional figures

We present here:

  • Figure 13. Rf as a function of P for MNIST and FashionMNIST with the corresponding predicted slope, omitted in the main text.
  • Figure 14. Relative diffeomorphisms stability Rf as a function of depth for simple and deep nets.
  • Figures 15 and 16. Diffeomorphisms and inverse of the Gaussian stability Df and 1/Gf vs test error for CIFAR10 and the set of architectures considered in section 4.
  • Figure 17. Df , 1/Gf and Rf when using the mean in place of the median for computing averages ⟨⋅⟩.
  • Figure 18. Curves in the (epsilont, Rf ) plane when varying the training set size P for FullyConnL4, LeNet, ResNet18 and EfficientNetB0.
  • Figures 19 and 22. Error estimates for the main quantities of interest—often omitted in the main text for the sake of figures' clarity.

Figure 13.

Figure 13. Relative stability to diffeomorphisms Rf (P) at δ = 1. Analogous to figure 3 (right) but here we have MNIST (a) and (b) and FashionMNIST (c) and (d) in place of CIFAR10. Stability monotonically decreases with P. The triangles give a reference for the predicted slope in the stripe model—i.e. Rf P−1—see section 6. The slopes in case of ResNets are compatible with the prediction. For EfficientNets, the second panel of figure 3 suggests that stability to diffeomorphisms is less important. Here, we also see that it builds up more slowly when increasing the training set size. Finally, blue and red colors indicate different interpolation methods used for generating image deformations, as discussed in appendix C. Results are not affected by this choice.

Standard image High-resolution image
Figure 14.

Figure 14. Relative stability to diffeomorphisms as a function of depth. Rf as a function of the layers relative depth (i.e. $\frac{\text{current}\;\text{layer}\;\text{depth}}{\text{total}\;\text{depth}}$) where '0' identifies the output of the 1st layer and '1' the last. The relative stability is measured for the output of layers (or blocks of layers) inside the nets for simple architectures (1st column) and deep ones (2nd column) at initialization (dashed) and after training (full lines). All nets are trained on the full CIFAR10 dataset. ${R}_{{f}_{0}}\approx 1$ independently of depth at initialization while it decreases monotonically as a function of depth after training. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Rf . The results are obtained by log-averaging over single realizations.

Standard image High-resolution image
Figure 15.

Figure 15. Test error epsilont vs stability to diffeomorphisms Df for common architectures when trained on the full ten-classes CIFAR10 dataset (P = 50k) with SGD and the cross-entropy loss; the EfficientNets achieving the best performance are trained by transfer learning from ImageNet (⋆)—more details on the training procedures can be found in appendix E.1. The color scale indicates depth, and the symbols the presence of BN (◊) and skip connections (†). Df correlation with epsilont (corr. coeff.: 0.62) is much smaller than the one measured for Rf — see figure 3. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Df . The results are obtained by log-averaging over single realizations.

Standard image High-resolution image
Figure 16.

Figure 16. Test error epsilont vs inverse of stability to noise 1/Gf for common architectures when trained on the full ten-classes CIFAR10 dataset (P = 50k) with SGD and the cross-entropy loss; the EfficientNets achieving the best performance are trained by transfer learning from ImageNet (⋆)—more details on the training procedures can be found in appendix E.1. The color scale indicates depth, and the symbols the presence of BN (◊) and skip connections (†). Gf correlation with epsilont (corr. coeff.: 0.85) is less important than the one measured for Rf —see figure 3. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Gf . The results are obtained by log-averaging over single realizations.

Standard image High-resolution image
Figure 17.

Figure 17. Test error epsilont vs Df , 1/Gf and Rf where ⟨⋅⟩ is the mean. Analogous to figures 1519, we use here the mean instead of the median to compute averages over samples and transformations.

Standard image High-resolution image
Figure 18.

Figure 18. Test error epsilont vs relative stability to diffeomorphisms Rf for different training set sizes P. Same data as figure 5, we report here curves corresponding to training on different set sizes for four architectures. The other architectures considered together with the power-law fit are left in background. For a small training set, CNNs behave similarly. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Rf . The results are obtained by log-averaging over single realizations.

Standard image High-resolution image
Figure 19.

Figure 19. Test error epsilont vs relative stability to diffeomorphisms Rf with error estimates. Same data as figure 5, we report error bars here. Statistics: each point is obtained by training five differently initialized networks; each network is then probed with 500 test samples in order to measure Rf . The results are obtained by log-averaging over single realizations.

Standard image High-resolution image
Figure 20.

Figure 20. Test error epsilont vs Df , Gf and Rf (on the columns) for different data sets (on the rows). The corresponding correlation coefficients are shown in table 2. Lines 1–2: MNIST and SVHN both contain images of digits and show a similar epsilont(Rf ). Line 3: FashionMNIST results are comparable to the CIFAR10 ones shown in the main text. Line 4: Tiny ImageNet32 is a re-scaled (32 × 32 pixels) version of ImageNet with 200 classes and 100 000 training points. The task is harder than the other data sets and is such that we could not train simple networks (FC, LeNet) on it—i.e. the loss stays $\mathcal{O}(1)$ throughout training—so these are not reported here.

Standard image High-resolution image
Figure 21.

Figure 21. Test error epsilont vs Rf for CIFAR10 and varying δ and cut-off c. Titles report the values of the varying parameters together with corr. coeffs. Parameters corresponding to allowed diffeo are indicated by the green background. Red and blue colors correspond to different interpolation methods. Overall, results are robust when varying these parameters.

Standard image High-resolution image
Figure 22.

Figure 22. Stability toward Gaussian noise (Gf ) and diffeomorphisms (Df ) alone, and the relative stability Rf with the relative errors. Analogous to figure 6 in which error estimates are omitted to favour clarity. Here we fix the cut-off to c = 3 and show error estimates instead. Columns correspond to different data-sets (MNIST, FashionMNIST and CIFAR10) and rows to architectures (ResNet18 and EfficientNetB0). Each panel reports Gf (blue), Df (orange) and Rf (green) as a function of P and for different cut-off values c, as indicated in the legend. Statistics: each point in the graphs is obtained by training 16 differently initialized networks on 16 different subsets of the data-sets; each network is then probed with 500 test samples in order to measure stability to diffeomorphisms and Gaussian noise. The resulting Rf is obtained by log-averaging the results from single realizations. As we are plotting quantities in log scale, we report the relative error (shaded).

Standard image High-resolution image

Footnotes

  • This article is an updated version of: Petrini L, Favero A, Geiger M and Wyart M 2021 Relative stability toward diffeomorphisms indicates performance in deep nets Advances in Neural Information Processing Systems vol 34 ed M Ranzato, A Beygelzimer, Y Dauphin, P S Liang and J Wortman Vaughan (New York: Curran Associates) pp 8727–39.

  • Throughout the paper, if not specified otherwise, bi-linear interpolation is employed.

  • With the only exception of the ImageNet results (central panel) in which only one trained network is considered.

  • Correlation coefficient: $\frac{\mathrm{Cov}(\mathrm{log}\,{{\epsilon}}_{\text{t}},\mathrm{log}\,{R}_{f})}{\sqrt{\mathrm{Var}(\mathrm{log}\,{{\epsilon}}_{\text{t}})\mathrm{Var}(\mathrm{log}\,{R}_{f})}}$.

  • $G(s)={(2\pi {\sigma }^{2})}^{-1/2}\,{\text{e}}^{-{s}^{2}/2{\sigma }^{2}}$.

Please wait… references are loading.