Paper The following article is Open access

The reusability prior: comparing deep learning models without training

and

Published 20 April 2023 © 2023 The Author(s). Published by IOP Publishing Ltd
, , Citation Aydın Göze Polat and Ferda Nur Alpaslan 2023 Mach. Learn.: Sci. Technol. 4 025011 DOI 10.1088/2632-2153/acc713

2632-2153/4/2/025011

Abstract

Various choices can affect the performance of deep learning models. We conjecture that differences in the number of contexts for model components during training are critical. We generalize this notion by defining the reusability prior as follows: model components are forced to function in diverse contexts not only due to the training data, augmentation, and regularization choices, but also due to the model design itself. We focus on the design aspect and introduce a graph-based methodology to estimate the number of contexts for each learnable parameter. This allows a comparison of models without requiring any training. We provide supporting evidence with experiments using cross-layer parameter sharing on CIFAR-10, CIFAR-100, and Imagenet-1K benchmarks. We give examples of models that share parameters outperforming baselines that have at least 60% more parameters. The graph-analysis-based quantities we introduced for the reusability prior align well with the results, including at least two important edge cases. We conclude that the reusability prior provides a viable research direction for model analysis based on a very simple idea: counting the number of contexts for model parameters.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Artificial neural networks with larger numbers of parameters often outperform their smaller counterparts. For instance, larger models achieve state-of-the-art in various benchmarks in computer vision [112] and natural language processing [1317]. Yet there are exceptions to this pattern where some models have significantly lower numbers of learnable parameters with a comparable or increased performance (i.e. higher parameter efficiency). More specifically, additional unlabeled data or curation of larger datasets reduce error in a predictable manner [18]. Hoffmann et al prove that it is possible to outperform previous state-of-the-art models with significantly more compact ones by training with more data. They provide a systematic analysis that reveals that increasing the number of parameters and increasing the size of the training set are equally important for optimal usage of compute resources for training efficient models [19].

Model performance is affected from not only the number of parameters and training samples, but also the design, augmentation, and regularization choices. We conjecture that in essence these choices impact the overall number of concrete contexts for model parameters.

1.1. What is a context?

In figure 1, there is an informal description of context. We define the set of all contexts for a given parameter as follows: Definition 1.

definition Let fiw be a function node that at least takes a given parameter node w and an associated input hi ; let $\mathcal{O}_k$ be any node from the set of all output or leaf nodes $\mathcal{L}$ in graph $\mathcal{G}$, and p a path that connects fiw to $\mathcal{O}_k$. The set of all contexts for w, $\mathcal{C}_{w}$ is given by

Equation (1)

Figure 1.

Figure 1. A context is a path from a parameter associated with an input to an output. (a) C1 is the bold path from w1 associated with i1 through functions f1, f2, $\ldots f_m$ contributing to the output O1. Note that w1 can contribute to O1 through multiple contexts (e.g. C1 and C2). (b) A concrete analogy for C1 and C2 given on a small architecture.

Standard image High-resolution image

Note that fiw can be any function node within the model graph $\mathcal{G}$ that at least takes parameter w and some hidden or input feature hi . Parameter w can be used in multiple places within the model with separate inputs leading to different contexts (i.e. $f_{jw}(h_j, w, \ldots)$). In other words, $\mathcal{C}_w$ is the union of all contexts associated with w, regardless of where w is used within the model or whether w is explicitly shared.

An important assumption we make is that fiw is mostly nonlinear in a similar manner to the mainstream deep learning models. If all function nodes were instead equivalent to a given linear function, the model could be collapsed into a single layer where each input is associated with a single learnable parameter. Such a model can not learn functions more complicated than a linear map (i.e. a single matrix multiplication).

Overall, each parameter has its own set of contexts which is the union of all paths contributing to an output node, where hi is a part of the computational graph rather than an instance of an actual output feature from a previous layer. We use this definition of context for analyzing model architectures and defining other relevant concepts. In figure 1, we demonstrate some of the possible contexts that can arise inside a model graph. In figure 1(b) the computational graph of a model with three layers is described. How can model design change the number of possible contexts? We provide an example in figure 2, comparing two configurations that yield different numbers of contexts for the same number of parameters.

Figure 2.

Figure 2. Given the same number of parameters, model design can change parameter efficiency: (a) w1 contributes to a single context C1 (b) w1 contributes to an additional context C2. The reusability prior suggests that w1 is likely to be forced to function for both contexts and hence will become more reusable. Furthermore, by relying on the repetition of w1 in both contexts, more complex functions can be succinctly described with the same number of parameters.

Standard image High-resolution image

1.2. Parameter sharing

Small models often hit an upper bound early on in terms of performance for large-scale benchmarks, yet parameter sharing is ubiquitous in deep learning. For instance, convolutional kernels [20, 21] repeat spatially (i.e. sliding window), while recurrent modules [2226] (or any autoregressive model such as [2729]) repeat temporally. Domain-specific benefits and being able to work with different input or output sizes are fair motivations for parameter sharing; ultimately model capacity is recovered by scaling to larger models at the cost of more floating point operations per second (FLOPS).

An explicit form of parameter sharing is cross-layer parameter sharing which ties the weights of architecturally repeating layers together. The literature shows that, at the cost of more FLOPs, cross-layer parameter sharing can sometimes improve parameter efficiency. Once a model with cross-layer parameter sharing is scaled back to a reasonable capacity, it has a reasonable chance to outperform the original baseline model [3035]. In section 5.6 we confirm a similar outcome with our examples on some of the EfficientNetv2 models [36].

1.3. Improving parameter efficiency

We claim that all mainstream DL models already share parameters directly or indirectly to various degrees, and this affects their parameter efficiency. In section 4.1, we introduce a simple horizontal unrolling approach that can make parameter sharing explicit for all directed acyclic graphs (DAG).

We propose the generalized notion reusability prior to disentangle the intrinsic design and training choices that affect a model's performance in section 4. A consequence of the reusability prior is that for models of similar size and capacity, designs that maximize the expected number of contexts are more likely to improve parameter efficiency. Our experiments on EfficientNetv2 as well as the literature on cross-layer parameter sharing provide supporting evidence for the reusability prior.

We introduce a simple counting approach for model comparison. By treating relative frequencies derived from the number of all possible contexts per learnable parameter as a probability distribution, we define quantities for the comparison of model graphs, including entropy, expected spread, and total surprisal. We give formal proof that when the total number of contexts is held constant, increasing the expected spread reduces the entropy. We also introduce approaches to estimate performance for directly comparing models without training. We compare analysis and experiment results in tables 3 and 4. We scope our work by only focusing on the model design aspect described in section 2.2.3. The data and training aspects can be taken into account as well when considering the number of possible contexts. We leave this for future work.

In summary, our major contributions are twofold:

  • We introduce the reusability prior (section 4.2), and provide a methodology based on graph analysis (section 4.3). To our knowledge, we are the first to introduce a generalized notion of reusability that ties training, design, and data aspects together. For the design aspect, we introduce graph analysis based quantities derived from counting the number of contexts for each learnable parameter 1 . Overall, the quantities we introduce allow us to compare arbitrary DAGs or model architectures without relying on any training. Our graph analysis based quantities aligned well with the empirical results, including at least two important edge cases in section 5 in tables 3 and 4.
  • We empirically confirm that it is possible to achieve higher parameter efficiency by aggressively sharing parameters in EfficientNetv2 models. In our experiments, even though the numbers of parameters were at least 60% larger for the original baseline EfficientNetv2-b0 models, we observed that EfficientNetv2-S models with aggressive parameter sharing consistently outperformed the baselines on CIFAR-10 [37], CIFAR-100 [38], and Imagenet-1K [39] image classification benchmarks.

2. Background

Fundamentally, convolutional and recurrent neural networks (RNNs) rely on architectural priors based on reusing model components through space and time. Implicitly, RNNs share recurrent modules through time [40] and convolutional neural networks (CNNs) share convolutional kernels through space. Moreover, it is possible to share 3d convolution kernels spatiotemporally [41]. Reusing components in various ways is an important underlying pattern as larger and larger architectures are adopted [33, 35, 4248], searched [4856], skip connections are utilized [5760], domain specific symmetries are captured [6166], model size is reduced or manipulated [30, 31, 6772], and parameters are either globally shared or transferred from different domains or models [48, 7381].

DenseNet models are claimed to be compact due to the concatenation of previous features at each subsequent layer [58]. Another strategy for compactness is used by Xception, or 'Extreme Inception' [82]. It minimizes the parameters from $3\times3$ convolutions by using depthwise $3\times3$ convolution layers, i.e. one independent convolution kernel per channel. Then it relies on $1\times1$ convolutions which allow reusing the depthwise convolution outputs by taking the linear combinations of the outputs. $1\times1$ convolutions are an important form of parameter sharing as they are equivalent to fully connected layers shared in the height and width dimensions. MobileNet models [83, 84] also focus on maximizing the $1\times1$ convolution operations. Achieving compactness via soft sharing [85], student-teacher architectures [86, 87], pruning and quantization [88] are other relevant approaches.

There is some research on the analysis of model architecture as well as entropy based approaches to analyze how training data is utilized. For instance, Peer et al introduce batch entropy regularization that allows training very deep models without skip connections [89]. Wickstrøm et al analyze different training phases of deep models with information plane theory using Rényi's entropy [90]. Levine et al theoretically predict that there is an 'optimal depth-to-width allocation for a given self-attention network size'. They recommend significantly wider models as the model size increases and passes a certain threshold [91]. Bu et al investigate the topological entropy of neural networks with ReLU activations, providing an upper bound of $O(d \log{w})$ where d is the depth and w is the width or number of neurons at each layer [92]. Our work diverges from Bu et al as we use Shannon entropy [93] and provide a methodology that can work with any DAG without requiring constant width or a specific model. Furthermore, we focus on parameter efficiency, thus we opt for a probability distribution that is based on the relative frequencies derived from the number of contexts for each parameter.

2.1. Cross-layer parameter sharing

SharesNet [30] demonstrates that for wide residual networks (WRNs) [94], it is possible to surpass the original model's performance with parameter sharing and then scaling (i.e. increasing depth and/or width). Similar to lookup based CNNs [95], Savarese and Maire train a model that learns to take a linear combination of a shared pool of kernels [31]. Then the coefficients for the linear combination of kernels are also used for constructing a similarity matrix so that similar layers can be shared. This approach outperforms the original WRN baseline on Imagenet-1K and CIFAR-10 with a similar or less number of parameters. Atom-coefficient decomposed convolution [32], inspired from [96], first decomposes convolutional kernels into bases and coefficients, and then shares the coefficients across layers. This leads to improved parameter efficiency for very deep convolutional nets [42], ResNet, and WRN baselines. Shapeshifter networks used for neural parameter allocation search do not make architectural assumptions such as repeated layers, but instead, learn to reuse parameters from a limited pool by transforming them into weights for any architecture [69].

Aside from the computer vision domain, a lite BERT, bidirectional encoder representations from transformers [33] uses cross-layer parameter sharing to first reduce the number of parameters and then scale up to a capacity comparable to the original model [97]. This surpasses the original model's performance for language understanding tasks. Other relevant examples share attention weights [34], speed up training by parameter sharing and then unsharing [98], and explore sandwich style weight sharing for generative transformers [35].

The usage of implicit models can be considered a continuous form of cross-layer parameter sharing. Well known examples are neural ordinary differential equation solvers [99, 100], variants of deep equilibrium models (DEQ), and multiscale DEQs [101, 102].

2.2. Maximizing the number of contexts

Deep learning literature has various strategies for improving performance. Intrinsically, they often seem to maximize the number of contexts for model components. In section 1.1, and figure 1 we provide more details about what is meant by context. Essentially, the number of contexts is affected from data, augmentation, training, regularization, and model design choices.

2.2.1. Diversity in input affecting the number of contexts

Aside from increasing the number of samples, a performance gain can be observed with data augmentation techniques and related approaches that rely on diversifying individual training samples (e.g. cutout [103]), or combining multiple samples (e.g. mixup [104], cutmix [105] etc).

2.2.2. Diversifying the roles of model components during training

By diversifying the hidden representations, regularization such as dropout [106], stochastic depth [107], and block drop [60] increase the total number of contexts that can arise from the same training data.

2.2.3. Design choices impacting the role and scope of each model component

Decisions such as network depth and width, having CNN layers [20, 108], recurrent modules [40], residual connections [57], attention layers [109], training input size, and cross-layer parameter sharing [30] impact the expected number of contexts within a model's computational graph.

Overall, the literature hints at an important pattern that can improve the parameter efficiency of deep learning models. We generalize this notion as the reusability prior in section 4.2.

3. Overview

We propose a generalized notion of reusability and the reusability prior that encompasses training, data, and model design aspects in deep learning in section 4. We then focus on the model design aspect to analyze and compare models with graph analysis. In addition to the supporting evidence from the literature discussed in section 2, we conduct our own experiments by applying aggressive cross-layer parameter sharing to some EfficientNetv2 [36] models. We share the results from this strategy on image classification benchmarks in section 5. We analyze the computational graphs of EfficientNetv2 models and compare the quantities based on the reusability prior with the empirical results in sections 5.7 and 5.8.

4. The notion of reusability

We claim that maximizing the number of contexts for learnable parameters is critical. For instance, all mainstream DL models directly and/or indirectly share parameters, because reusing the output of a component in more than one place within a deep architecture is an indirect way of parameter sharing. Intrinsically, any deep learning model can be converted to a functionally equivalent form where parameter sharing is made explicit.

4.1. Horizontal unrolling

It is possible to horizontally duplicate parameters to remove multiple output edges in a given model graph so that each node has a single output edge. This would reproduce all node outputs from scratch, eliminating indirect parameter sharing (i.e. replacing feature sharing with direct parameter sharing). The resulting unrolled models would be functionally equivalent. We illustrate our point in figure 3. The nodes in earlier layers are duplicated more as they play more diverse roles. We introduce a simple recursive algorithm for horizontal unrolling in appendix A.

Figure 3.

Figure 3. Horizontal unrolling reproduces all shared node outputs from scratch, eliminating indirect parameter sharing or feature reuse and instead replacing it with direct parameter sharing with duplicated parameters. (a) w1 in the original graph has contexts C1 and C2 (b) w1 is duplicated as many times as the number of its contexts in the horizontally unrolled graph (i.e. twice in this example) (c) A uniform graph has a single context for every parameter. It can be considered a non-shared generalization of any model graph, as there is neither feature nor parameter reuse.

Standard image High-resolution image

Horizontal unrolling can convert any mainstream DL model into a form where parameter sharing is always direct, and each parameter is duplicated as many times as their number of contexts that can arise due to the computational graph. To our knowledge, no mainstream DL model exponentially grows in terms of the number of parameters with depth. Hence their unrolled graphs share parameters. This manner of parameter sharing in the unrolled graph uses exponentially fewer parameters than a more general graph exemplified in figure 3(c) that does not share any parameters. We call such graphs a uniform graph and formally define them as follows: Definition 2.

definition Let $\mathcal{G}$ be a model graph. $\mathcal{G}$ is uniform if and only if $|\mathcal{C}_{w_i}| = 1 \,\,\, \forall{w_i} \in \mathcal{G}$, where $|\mathcal{C}_{w_i}|$ is the cardinality of the set of all contexts for parameter wi .

Note that for the horizontal unrolling to work, any cyclic graph would need to be vertically unrolled into a DAG first. This already happens in practice during training. Additionally, for differentiable approaches that transform a set of learnable parameters to weights, the parameter transformation and the originally shared parameters would need to be included in the graph (i.e. every wi would be replaced by the DAG that generates it).

There is a connection between parameter efficiency and parameter sharing. By estimating what the number of contexts in the unrolled form would be for each parameter, we quantify expected spread, total surprisal, and entropy for computational graphs of models.

4.2. The reusability prior

We conjecture that the expected number of contexts for model components is the major reason for differences in model performance. We introduce the reusability prior as follows:

Model components are forced to function in diverse contexts not only due to the training, data, augmentation, and regularization choices but also due to the model design itself. These aspects explicitly or implicitly impact the expected number of contexts for model components. Until model capacity is reached, maximizing this number improves parameter efficiency for models of similar size and capacity. By relying on the repetition of reusable components, a model can learn to describe an approximation of the desired function more efficiently with fewer parameters.

We provide justifications from the literature in section 2, definitions in sections 1.1, and 4.3, finally supporting evidence from our experiments in section 5.

4.3. Quantities for model comparison

Based on the reusability prior, and the notion of context, we define the expected spread, entropy, and total surprisal. Then we provide two different ways of estimating model performance based on total surprisal and expected spread. Definition 3.

definition For a given model graph $\mathcal{G}$, the expected spread is given by

Equation (2)

where $|\mathcal{C}_{w_i}|$ is the cardinality of the set of all contexts for wi , $N_\mathcal{G}$ the number of learnable parameters, and $p(w_{i})$ is the relative frequency:

Equation (3)

where $N_\mathcal{C} = \sum\limits_{w_{j} \in \mathcal{G}} |\mathcal{C}_{w_{j}}|$ the total number of contexts in $\mathcal{G}$.

In section 4.4.1 we prove that the expected spread is equal to the relative entropy or Kullback–Leibler divergence [110] of P(W) from the discrete uniform distribution.

Overall, this quantity prescribes a design maximizing the expected number of contexts 2 . Note that $N_\mathcal{C}$ and $N_\mathcal{G}$ grow differently. In 2, similar to $N_\mathcal{C}$, $|\mathcal{C}_{w_i}|$ grows exponentially with depth for the mainstream deep architectures.Definition 4.

definition Let wi be a parameter in graph $\mathcal{G}$. The entropy of the discrete probability distribution for the parameters of $\mathcal{G}$ based on the number of contexts is given by

Equation (4)

where $p(w_{i}) = |\mathcal{C}_{w_{i}}|/ N_\mathcal{C}$ is the relative frequency.

Note that $p(w_{i})$ is also used in the calculation of expected spread in 3. Definition 5.

definition Let wi be a parameter with relative frequency $p(w_{i})$ in graph $\mathcal{G}$. The total surprisal is given by

Equation (5)

Since the multiplicative $p(w_{i})$ term is removed, this is no longer the expected surprisal, i.e.entropy in 4. Consequently, this quantity gives the surprisal of each parameter equal weight 3 .

4.4. Maximizing the expected spread minimizes the entropy

If we consider parameters that affect smaller portions of the model (i.e. with a smaller spread) more specific or surprising, then for optimal encoding of the horizontally unrolled graph, expected spread would encourage assigning shorter bit lengths for more repeated components. For instance, parameters of the mainstream deep learning models have an exponential distribution where parameters in the earlier layers have exponentially larger numbers of contexts. For large models with the same number of parameters, this results in a much lower entropy compared to a uniform distribution which has the maximum possible entropy $\log_{2}{N_\mathcal{G}}$.

Overall, when other conditions such as model size, capacity, training data etc are similar, the reusability prior encourages increasing the expected spread. This may lead to an improvement in parameter efficiency as the entropy, i.e. the expected bit length, is reduced. In other words, unrolled graphs have smaller description lengths when there is a lot of repetition in the earlier layers: to reuse is to simplify.

4.4.1. Proofs for the connections between the quantities for model comparison

Lemma 1.

lemma Let P(W) be the discrete probability distribution of parameters based on their number of contexts in graph G. Kullback–Leibler (KL) divergence $D_{\mathrm{KL}}(P(W)||P_{U}(W))$ from the discrete uniform distribution $P_{U}(W)$ is equivalent to the expected spread.

Proof.

proof Let the relative frequency of wi be $p(w_i) = c_i / N_\mathcal{C}$ where $c_i = |\mathcal{C}_{w_{i}}|$ and $N_\mathcal{C} = \sum\limits_{w_{j} \in \mathcal{G}} |\mathcal{C}_{w_{j}}|$. Then the KL divergence of the probability distribution of parameters based on the number of contexts P(W) from uniform distribution $P_U(W)$ where $p_U(w_i) = 1/N_{\mathcal{C}}$ is given by

Note that, if P(W) is derived from the graph in figure 3(a) by directly counting the frequencies from its unrolled version in figure 3(b), $P_U(W)$ can correspond to the uniform graph in figure 3(c). In general, the cases where $p(w_j) = 0$ and $p_{U}(w_j) = 1/N_\mathcal{C}$ do not change the summation. Theorem 1.

theorem Let $N_\mathcal{C} = \sum\limits_{w_{j} \in \mathcal{G}} |\mathcal{C}_{w_{j}}|$ be the total number of all contexts. $\forall \mathcal{G}_i$ when $N_{\mathcal{C}}$ is held constant, maximizing the expected spread minimizes the entropy.

Note that $N_{\mathcal{C}}$ is equivalent to the number of parameters in G's horizontally unrolled uniform graph as in figures 3(c) and 2. For graphs with identical unrolled uniform graphs, maximizing the expected spread is equivalent to parameter sharing 4 . For different architectures, as long as $N_{\mathcal{C}}$ is held constant, the theorem still holds. The formal proof is as follows:Proof.

proof For any discrete probability distribution, it is already known that:

Therefore, from lemma 1 we show that KL divergence is always:

Thus, we can write:

Hence, when $N_\mathcal{C}$ is constant, reducing the entropy would increase the expected spread and vice versa.

4.5. Estimating model performance

Model performance is majorly associated with the number of parameters and the size of the training data. For CNNs the training image size is relevant as well. Yet it is often unclear how exactly some model designs consistently outperform others when the number of parameters as well as the training data and conditions are similar. A consequence of the reusability prior is that by estimating the relative frequencies of each parameter, it is possible to quantify how the model design itself impacts the entropy, expected spread, and total surprisal. For the scenario where the training data and strategies such as regularization etc are unchanged, we propose using total surprisal to predict model performance, with the assumption that when other conditions are similar, a model with a higher descriptive ability would perform better. We normalize this quantity for different model sizes and multiply it by the input size as follows: Definition 6.

definition Let $\mathcal{S}_\mathcal{G}$ be the total surprisal of graph $\mathcal{G}$, NI the total number of input nodes and $|\mathcal{G}|$ the summation of the total number of input, output and weight nodes. The estimated performance is given by

Equation (6)

Note that the number of weight nodes can be larger than the number of parameters $N_\mathcal{G}$ when explicitly sharing parameters.

4.5.1. An alternative estimation of model performance

As a potential alternative to the total surprisal in (6), expected spread multiplied by the number of learnable parameters $N_\mathcal{G}$ can be used (i.e. replacing $\mathcal{S}_\mathcal{G}$ with $N_\mathcal{G} E[\![\log_{2}|\mathcal{C}| + 1]\!] $). This alternative estimation is given by:

Equation (7)

Our main justification for using expected spread is given in section 4.2. For models with a comparable number of parameters and model size, models with a higher expected spread (i.e. with the ability to approximate more complicated functions) would likely perform better after training until convergence. We observe this in table 2 for multiple experiments and datasets.

Table 1. Details on CIFAR and Imagenet datasets.

DatasetTrainValidationClasses
CIFAR-10 [38]50 00010 00010
CIFAR-100 [38]50 00010 000100
Imagenet-1K [115]1.28M50 0001000

Table 2. EfficientNetV2 [36] trained from scratch on CIFAR-10, CIFAR-100, and Imagenet-1K. At the cost of more FLOPs, V2-S-shared models that aggressively apply cross-layer parameter sharing (bold) achieve better top-1 accuracy scores than V2-B0 models that have at least 60% more parameters.

ModelDatasetParamsScoreAugm.BatchEpochFLOPs
V2-B0-sharedCIFAR-101.7 M95.3None5123000.7B
V2-B0-originalCIFAR-105.9 M95.6None5123000.7B
V2-S-shared CIFAR-10 3.1 M 96.0 None5123008.8B
V2-B0-sharedCIFAR-1001.8 M78.8AutoAug [116]5123000.7B
V2-B0-originalCIFAR-1005.9 M80.6AutoAug5123000.7B
V2-S-shared CIFAR-100 3.2 M 81.5 AutoAug5103008.8B
V2-B0-sharedImagenet-1K3.0 M73.7RandAug [117]4003520.7B
V2-B0-originalImagenet-1K7.1 M76.9RandAug4003520.7B
V2-S-shared Imagenet-1K 4.4 M 78.3 RandAug4003528.8B

Overall, by relying on our graph-based analysis and counting approach, we proposed two crude ways to estimate the performance of models without any training 5 . In practice, both have different strengths and weaknesses that we discuss in section 6.

Table 3. Analysis without training vs our experiment results from Imagenet-1K. Note that a naive prediction would correlate the given top-1 accuracy scores directly with the number of parameters. Our performance estimations based on total surprisal and expected spread (PG and $P^{^{\prime}}_G$ respectively) instead correctly favor V2-S-shared (bold) compared to V2-B0.

ModelScorePG $\text{P}^{^{\prime}}_\text{G}$ Exp.spreadEntropyParams ${\text{Img.}}$
V2-B0-shared73.725.0025.44746.608+10.23.0 M224
V2-B076.926.2826.72746.60810.27.1 M224
V2-S-shared 78.3 26.33 26.97 1505.5479.8 4.4 M 384
V2-S83.228.7229.281505.5469.821.6 M384

Table 4. Analysis without training vs some of the top-1 accuracy scores for larger models [36] for Imagenet-1K. Note that a naive prediction would correlate the given top-1 accuracy scores directly with the number of parameters. Our graph analysis based estimations instead assign estimated performances based on total surprisal and expected spread (PG and $P^{^{\prime}}_G$ respectively) to ResNet-50 (bold) lower than V2-B3 and V2-S models which both have a smaller number of parameters.

ModelScorePG $\text{P}^{^{\prime}}_\text{G}$ Exp.spreadEntropyParams ${\text{Img.}}$
V2-B179.8026.7527.20913.410.28.2 M240
ResNet-50 80.30 27.25 27.60 465.4 13.3 25.6 M 380
V2-B382.1027.7128.201167.210.614.5 M300
V2-S83.6+28.7229.281505.59.821.6 M384
V2-M85.1029.9+30.4+2171.19.854.4 M480
V2-L85.7030.5+30.9+3095.410.3119.0 M480

5. Methodology and experiments

We analyzed the computational graphs of EfficientNetv2 [36] models, without training, to estimate the quantities described in section 4.3. For training models from scratch in our experiments, we adopted the same EfficientNetv2 models to explore the effects of aggressive parameter sharing. We then compared the empirical results with the results from our graph analysis based quantities. We used Tensorflow [111] for the experiments and our own Python [112] library for the graph-based analysis.

5.1. Analyzing computational graphs

We quantify the predictions from the reusability prior for model graphs as follows:

  • (i)  
    We estimate the relative frequencies of the learnable parameters in the horizontally unrolled computational graph of a given model (e.g. for the original graph in figure 3(a) it is possible to estimate the relative frequencies from its unrolled form in figure 3(b). Please see appendix B for the full example.).
  • (ii)  
    The resulting probability distribution allows deriving model-level quantities that we described in section 4.3.
  • (iii)  
    We use total surprisal and expected spread for estimating model performance as described in section 4.5.

In practice, since the EfficientNetv2 and ResNet-50 models have convolutional layers, we took into account the full computational graph that would depend on aspects such as image size, strided convolutions, batch normalization, and pooling layers. Thus we included all learnable parameters from convolutional kernels, batch normalizations, and final fully connected layers for classification which have bias weights. The Python implementation of our graph analysis relies on imitating the computational graph of EfficientNetv2 and ResNet-50 models with layers that, instead of inference, count the aggregated number of contexts. This does not need to create a horizontally unrolled graph which would have been exponentially more expensive. Yet this approach still allows gathering the total number of contexts for each learnable parameter in a precise manner and calculating model-level quantities for comparison. Our full code release to reproduce the analysis results is available at [113].

5.2. EfficientNetv2

EfficientNetv2 models have a relatively high parameter efficiency, and they have competitive performance in image classification benchmarks. Would applying aggressive parameter sharing to an already compact model still improve parameter efficiency? To answer this question, we conducted multiple experiments on EfficientNetv2 models. We focused on comparing EfficientNetv2-B0, and EfficientNev2-S models. For cross-layer parameter sharing, our experiments revealed results in agreement with the literature discussed in section 2.1. To make it easier to minimize confounding variables due to data and training strategies, we limited the models we trained from scratch to only EfficientNetv2. We conducted our training in a controlled setting, spanning three different benchmarks. This helped us eliminate differences due to hardware and hyperparameters for the selected models.

5.3. Aggressive parameter sharing

For the EfficientNetv2 experiments, we modified the official Tensorflow implementation [114] to be able to optionally apply aggressive parameter sharing. This strategy relies on roughly treating weight matrices of the same shape as the same. One exception is that we did not share the batch normalization weights to help the training stay stable.

During the initialization of models with parameter sharing, we keep a global scope dictionary. Convolutions of the same scope are only created once. Convolutional layers are represented as four-dimensional matrices where their dimensionality is defined by channel and kernel sizes. More precisely, we map each shared convolution to a scope name that is constructed by combining the number of input channels, number of output channels, kernel size, and strides.

5.4. Hyperparameters

For cross-layer parameter sharing, we used a new hyperparameter which aggressively shares convolutions but uses separate batch normalization layers: model.weight_sharing = all_but_bn. The only change we made to the hyperparameters when comparing an original model with one that shares convolutional layers is we used model.weight_sharing = None for the original one. We closely followed the default hyperparameters for EfficientNetv2 given by Tan and Le [36], except for the following changes:

  • For all models trained with CIFAR-10, instead of transfer learning, we trained from scratch. We used no augmentation and trained for 300 epochs without any training stages. We used the following hyperparameters: train_epoch = 300, batch_size = 512, data.ibase = 32, train.lr_warmup_epoch = 5, train.lr_sched = exponential, data.mixup_alpha = 0, data.cutmix_alpha = 0, train.lr_base = 0.016, model.bn_momentum = 0.99.
  • For all models trained with CIFAR-100, we followed the same scenario and hyperparameters, except for the data augmentation: train_epoch = 300, batch_size = 512, data.ibase = 32, train.lr_sched = exponential, train.lr_warmup_epoch = 5, data.augname = autoaug, train.lr_base = 0.016, model.bn_momentum = 0.99.
  • For all models trained with Imagenet-1K, we used the following hyperparameters: train_epoch = 352, batch_size = 400, train.stages = 4, train.lr_sched = exponential, model.dropout_rate = 0.075, train.lr_warmup_epoch = 5, data.ram = 2, train.ema_decay = 0.9999, train.lr_base = 0.025, model.bn_momentum = 0.99, data.augname = randaug.

The changes in the batch size and learning rate are necessary to be able to train the models with smaller GPUs (i.e. reduce the batch size and learning rate based on the same ratio). Note that, in the described setting, when trained from scratch, we observed lower performance for both the EfficientNetv2-B0 and EfficientNetv2-S models compared to [36]. Larger batch sizes and training longer can sometimes improve performance but we observed a substantial gap in performance between V2-S-shared vs V2-B0-original models regardless. In our previous experiments with Imagenet-1K, we tested different hyperparameters such as gclip, batch size, as well as a larger number of training epochs. Moreover, for CIFAR-100 we tested using no augmentation. None of these changes in the hyperparameters changed the ranking of the models in terms of top-1 accuracy given in table 2.

5.5. Datasets

In our experiments, we trained our models from scratch using the well known object classification benchmarks for visual recognition as given in table 1, which consists of CIFAR-10, CIFAR-100, and Imagenet Large Scale Visual Recognition Challenge (ILSVRC2012, also referred to as Imagenet-1K).

5.6. Experiment results from aggressive parameter sharing

When EfficientNetv2 models are trained from scratch, we observed a consistent increase in the parameter efficiency for the V2-S models with aggressive parameter sharing compared to V2-B0 which has significantly more parameters, as given in table 2.

Overall, at the cost of additional FLOPs, we observed improved parameter efficiency for all three benchmarks. Alongside the research regarding parameter sharing in the literature discussed in section 2.1, and combined with our graph analysis results in table 3, these results provide supporting evidence for the reusability prior.

5.7. Predictions from graph analysis vs experiment results

In our experiments with EfficientNetv2 models, V2-S-shared models consistently performed better than the V2-B0 models, despite having a significantly lower number of parameters. In table 3, we compared the results from our Imagenet-1K experiments with the results from our analysis of the computational graphs of V2-B0 and V2-S models, with and without parameter sharing. Unlike a naive prediction which would directly correlate the number of parameters with performance, our graph analysis assigned a higher score for the V2-S-shared model.

5.8. Graph analysis results for larger models

We compared our graph analysis based quantities for larger models that were trained by Tan et al in [36]. 6 In table 4 the estimated performance and expected spread of ResNet is lower, while its entropy and number of parameters are significantly higher than V2-B3 and V2-S models. Coincidentally, for the given models, the performance estimations were roughly $1/3$rd of the actual top-1 accuracies, but unlike the top-1 accuracy, our performance estimation scores are not bounded (e.g. PG can be negative).

6. Limitations and future work

The reusability prior we defined encompasses design, training, and data aspects in deep learning. In this work, however, we majorly focused on the design aspect. We ignored concrete contexts that can arise due to the concrete samples from the training data itself. When counting the number of contexts, we only considered what can arise due to the computational graph of the model as well as the input size. Generalizing the idea of context to take into account data and training techniques such as regularization or comparing different definitions of context is a future research direction. Another direction is to investigate potential analogies between microstates from statistical mechanics and our definition of context.

For larger models, we omitted additional experiments with cross-layer parameter sharing. From the perspective of reusability prior, all models we analyzed already explicitly (e.g. $1\times1$ convolutions) or implicitly (e.g. skip connections) share parameters. Therefore, we instead provided a comparison of existing scores from [36] combined with predictions from our graph-based analysis in table 4.

An important research direction is to test the reusability prior with neural architecture search. In practice, additional experiments will be likely necessary to first derive a utility function that aligns well with major practical concerns that may be task specific. For instance, the performance estimation approaches we introduced do not necessarily optimize for compute resources. For the same number of parameters, both approaches are biased towards deeper and narrower models, since our counting approach does not use any discount factor for the path length of a given context. Each context is counted as one, regardless of the distance between a given parameter and a target node. During graph analysis, incrementing the total number of contexts by a fractional number instead of incrementing by one may be an important improvement. Furthermore, total surprisal based performance estimation would penalize very deep models with cross-layer parameter sharing for not being descriptive enough for their large size. For shallower but exponentially larger models, the expected spread based estimation would severely penalize them for not reusing parameters enough. Overall, concerns including compute limitations, model size, FLOPs, and latency may be relevant for finding the most appropriate utility function. If such a function can be created by relying on the quantities or the graph analysis based framework we introduced, then without any training, searching viable model designs by solely relying on graph analysis can be an interesting direction.

7. Conclusion

Not all performance can be explained directly with the number of parameters and the training data. We introduced the reusability prior to point towards a deeper reason. We first demonstrated that, either explicitly or implicitly, all mainstream deep learning models reuse parameters. We then introduced a generalized notion of reusability that encompasses aspects such as training, data, and model design that affect the number of contexts with which model components have to function. Focusing on the model design aspect, for model comparison, we defined quantities namely entropy, expected spread, and total surprisal which rely on analyzing the computational graph of a model. We gave formal proof that maximizing the expected number of contexts for model components minimizes the entropy when the total number of available contexts is the same.

To test our approach in practice, we proposed two crude performance estimation approaches based on total surprisal and expected spread. We then compared EfficientNetv2 models by training them from scratch with and without cross-layer parameter sharing. A naive approach would have correlated models with a lower number of parameters with lower performance. We demonstrated a counter-example where EfficientNetv2-S models with parameter sharing outperformed the baseline EfficientNetv2-b0 models which have at least 60% more parameters. We gave another edge case with ResNet-50 where despite having significantly more parameters, it underperformed compared to EfficientNetv2-B3 and V2-S models. Our graph-based estimations of performance gave appropriate scores for both cases, correctly ranking ResNet-50 below these models in terms of performance.

In the model analysis based experiments, our counting approach allowed calculating model-level quantities for comparison, consequently correctly predicting the rank of all models in terms of top-1 accuracy. In contrast, the naive approach of relying on the number of learnable parameters failed to correctly rank the models of varying parameter efficiency (i.e. tables 3 and 4). In practice, as discussed in section 6, the reusability prior and our proposed framework may lead to new approaches for neural architecture search, or help researchers improve their model design in the right direction before any training is done.

Future work with further experiments and analysis will reveal whether our graph-based approach for estimating performance is generalizable to more models. Yet, for the models we investigated, our approach was able to correctly delineate at least two important edge cases in tables 3 and 4. Furthermore, the quantities we introduced based on the reusability prior aligned well with the experiment results as well as the existing results from the literature for larger models. For estimating these quantities we only relied on the model graphs without any training.

We conclude that the reusability prior provides a viable research direction for connecting different aspects of deep learning under the same framework that is majorly based on a very simple idea: counting the number of contexts for model parameters. This may lead to further research and important predictions on how deep learning models may be affected by different design, augmentation, and training choices.

Acknowledgments

The numerical calculations reported in this paper were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources). We appreciate the GPU resources that were allocated for a portion of this research.

We thank Ugur Halıcı, Emre Akbas, Selim Temizer, Kasım Öztoprak, the anonymous reviewers for their constructive input, and Georgina Romo Olivares for her support.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Conflicts of interest

The authors declare no conflict of interest.

Appendix A: Horizontal unrolling algorithm

For illustrative purposes, we provide an algorithm for the horizontal unrolling approach we described. Algorithm 1 takes the original DAG input, creates a new DAG with a root node, and calls unrollNode for each leaf node. In unrollNode, for each source node (i.e. each node that has an output to the current node), it recursively unrolls until the root node is reached while adding a duplicate of the current node for each source node. This results in a horizontally unrolled graph where the relative frequencies can be directly calculated by counting the duplicates for each learnable parameter. The relative frequencies are treated as a discrete probability distribution. Based on this probability distribution, we derive the model-level quantities for comparison.

Algorithm 1. Horizontal Unrolling.
Function unrollNode(node, root) is
  duplicate : = Node(node.name)
for source in node.sources do
   if isRoot(source) then
     root.addTarget(duplicate)
     continue
   end
   unrollNode(source).addTarget(duplicate)
end
end
Function horizontalUnroll(graph) is
  unrolled : = DAG('unrolled')
  leafNodes : = getLeafNodes(graph)
for node in leafNodes do
   unrollNode(node, unrolled)
end
  return unrolled
end

Due to the simplicity of algorithm 1, similar approaches may already exist for other domains; therefore despite not being able to find a relevant study, we suspect that we reinvented horizontal unrolling for a new use case. Our main use of the algorithm is to illustrate our point that all DAGs can be converted to a functionally equivalent form where all parameter sharing is made explicit, and in this form, the number of contexts can be directly counted.

For modern deep learning models, the horizontal unrolling algorithm would leave an unrolled graph with an exponentially large number of duplicated components. In practice, for our graph analysis, we do not apply horizontal unrolling to graphs at all due to the complexity. For instance, for MLPs the time and space complexity would both be in the order of $O(\mathrm{width}^\mathrm{depth})$. We instead use an optimized counting approach. Please see our Github repo for additional details: https://github.com/gozepolat/priors/tree/main/reusability.

Appendix B: An illustration of how we estimate model performances

Step by step, we analyze the graph depicted in figure 3(a) as an example. For large models, we use an optimized algorithm but for the sake of simplicity, we rely on algorithm 1 in this example.

  • (a)  
    Horizontally unroll the graph using algorithm 1. This results in the graph in figure 3(b).
  • (b)  
    For each learnable parameter directly count the repetitions in the unrolled graph, i.e. collect the frequencies: $w_1 = 2, w_2 = 2, w_3 = 2, w_4 = 2, w_5 = 1, w_6 = 1, w_7 = 1, w_8 = 1$
  • (c)  
    Estimate the probabilities as relative frequencies: $p(w_1) = 2/12$, $p(w_2) = 2/12$, $p(w_3) = 2/12$, $p(w_4) = 2/12$, $p(w_5) = 1/12$, $p(w_6) = 1/12$, $p(w_7) = 1/12$, $p(w_8) = 1/12$. We use these probabilities for the calculations of the total surprisal, entropy, and expected spread.
  • (d)  
    Total surprisal: $-\mathrm{log}_2(2/12)-\mathrm{log}_2(2/12)-\mathrm{log}_2(2/12)-\mathrm{log}_2(2/12)-\mathrm{log}_2(1/12)-\mathrm{log}_2(1/12)-\mathrm{log}_2(1/12)-\mathrm{log}_2(1/12) = -4 \times (\mathrm{log}_2(2/12)+\mathrm{log}_2(1/12)) = 24.68$
  • (e)  
    Entropy: $-4 \times (2/12 \times \mathrm{log}_2(2/12)+1/12 \times \mathrm{log}_2(1/12)) = 2.92$
  • (f)  
    Expected spread: $4 \times (2/12 \times \mathrm{log}_2(2) + 1/12 \times \mathrm{log}_2(1)) = 0.67$
  • (g)  
    Total surprisal based performance estimation: input nodes $N_{I} = 2$ and model size $|G| = 2+1+8 = 11$ and total surprisal = 24.68 so $P_G = \mathrm{log}2(24.68 \times 2/11) = 2.17$.
  • (h)  
    Expected spread based performance estimation: there are 8 learnable parameters and the expected spread = 0.67 so $P^{^{\prime}}_G = \mathrm{log}2((0.67 + 1) \times 8 \times 2/11) = 1.28$.

In our Github repository, we share multiple examples of how we derive the model-level quantities including the graphs from figures 2(a), (b) and 3(a), (c). The examples are available here: https://github.com/gozepolat/priors/tree/main/reusability.

Footnotes

  • To our knowledge, the majority of our definitions, lemmas, theorems, and proofs are original, except for what we borrowed from the information theory, namely entropy, surprisal, expected value, and an existing identity regarding Kullback–Leibler divergence used in 4.4.1. Despite not being able to find them in other relevant studies, we suspect that the horizontal unrolling approach and the idea of a uniform graph we introduced in section 4.1 likely already exist in other disciplines, and we reinvented them for our use case.

  • When all other conditions are fixed, expected spread can be considered a measure of descriptive reusability i.e. parameter efficiency. When comparing models with a similar number of learnable parameters and graph size, models with a higher expected spread can describe more complicated functions.

  • Total surprisal can be considered a measure of descriptive ability. That is, models that have higher total surprisal can describe more complicated functions.

  • If one takes into account the full scope of the reusability prior, e.g. the literature in section 2.2, then diversifying input and diversifying the role of model components with regularization would also increase the expected spread. We leave this for future work.

  • We share both estimation results in tables 3 and 4 as 'PG .' and '$P^{^{\prime}}_G$'.

  • Note that, compared to our experiments, V2-S model has a higher top-1 accuracy of 83.6-83.9 in their experiments. This does not change our comparison.

Please wait… references are loading.