Neural networks and quantum field theory

We propose a theoretical understanding of neural networks in terms of Wilsonian effective field theory. The correspondence relies on the fact that many asymptotic neural networks are drawn from Gaussian processes (GPs), the analog of non-interacting field theories. Moving away from the asymptotic limit yields a non-Gaussian process (NGP) and corresponds to turning on particle interactions, allowing for the computation of correlation functions of neural network outputs with Feynman diagrams. Minimal NGP likelihoods are determined by the most relevant non-Gaussian terms, according to the flow in their coefficients induced by the Wilsonian renormalization group. This yields a direct connection between overparameterization and simplicity of neural network likelihoods. Whether the coefficients are constants or functions may be understood in terms of GP limit symmetries, as expected from ’t Hooft’s technical naturalness. General theoretical calculations are matched to neural network experiments in the simplest class of models allowing the correspondence. Our formalism is valid for any of the many architectures that becomes a GP in an asymptotic limit, a property preserved under certain types of training.


Introduction
The relationship between asymptotic neural networks and Gaussian processes provides a strong hint towards a theoretical understanding of deep learning. Rather than considering a neural network to be determined by draws from a parameter space distribution, this perspective considers neural networks themselves as draws from a function space distribution. The essential idea is that a family of neural network architectures f θ,N : R d in → R dout (1.1) indexed by parameters θ and a discrete hyperparameter N admits a limit N → ∞ in which networks are drawn from a Gaussian process (GP), i.e. a Gaussian distribution on function space, where the functional / operator inverse of Ξ(x, x ) is the GP kernel. Of course, in practice one usually studies networks with large-but-finite N . These should be drawn from a distribution that receives 1/N corrections relative to the Gaussian distribution, i.e., a non-Gaussian process (NGP).
Learning is then a data-induced flow of the function space distribution; recent literature and the above argument together suggest that the distribution remains an NGP during training. The idea that neural networks are drawn from a non-Gaussian (but close-to-Gaussian) distribution on function space is immediately suggestive to physicists: such mathematics provides the backbone of perturbative quantum field theory (QFT), the framework that underlies numerous physical systems, from superconductors to the Standard Model of particle physics. From the perspective of Feynman's path integral, non-interacting ("free") quanta or particles are described by appropriate Gaussian distributions on function (field) space, where details depend on symmetries and particle properties such as Lorentz or rotational invariance and spin. When the fields do not have any vacuum expectation value the corresponding GP has mean zero and the 2n-point correlation functions are entirely determined by the 2-point statistics. Physically, this corresponds to n particles propagating past one another without interacting. Such quantum field theories are exactly solvable due to their Gaussian nature. Interactions between particles arise precisely due to non-Gaussian corrections to the log likelihood, known as the action in physics.
In this paper 1 we develop a sharp correspondence 2 between neural networks and quantum field theory. We will introduce a framework known as Wilsonian effective field theory (EFT) for studying neural networks, utilizing it to determine minimal NGP likelihoods associated with neural network architectures, making modifications from the usual physics contexts when necessary. For brevity, we will refer to the general idea as a NN-QFT correspondence.
Related work. An EFT approach to neural networks is possible whenever a family of architectures admits a GP limit, which is the case for many modern architectures. Though the orig-inal NN-GP correspondence [1] was in the context of infinitely wide single-layer fully-connected networks, which admit computable kernels [2], recent work has shown that infinitely wide deep fully-connected networks [3,4] are drawn from GPs, as are deep convolutional networks in the infinite channel limit [5,6]. In [7,8,9], Yang developed a language for understanding which architectures admit GP limits, which was utilized to demonstrate that any standard architecture admits a GP limit, i.e. any architecture that is a composition of multilayer perceptrons, recurrent neural networks, skip connections [10,11], convolutions [12,13,14,15,16] or graph convolutions [17,18,19,20,21,22], pooling [15,16], batch [23] or layer [24] normalization, and / or attention [25,26]. Furthermore, though these results apply to randomly initialized neural networks, appropriately trained networks are also drawn from GPs [27,28]. NGPs have been used to model finite neural networks in [29,30,31], with some key differences from our work. For these reasons, we believe that an EFT approach to neural networks is possible under a wide variety of circumstances.
While working on this project two papers appeared that also utilize Feynman diagrams in neural networks, and we would therefore like to differentiate our work. In [32] diagrams were associated to a different class of Gaussian integrals associated with neural network parameters drawn from a Gaussian distribution; they were used to put bounds on correlation functions in infinite width neural networks. In contrast, our Gaussian (and non-Gaussian) integrals are over the function space associated with the GP or NGP, crucially relying on the central limit theorem. [31] does consider the Gaussian and non-Gaussian distributions on function space, but focuses on preactivation distribution flow associated with perturbative corrections in the width parameter. Both of these interesting papers differ from our approach, which uses Feynman diagrams in our application of Wilsonian effective field theory to determine NGP likelihoods. It is applicable to any family of architectures admitting a GP limit, and knowledge of the GP limit and EFT together allow for the prediction of NGP correlation functions. Couplings may be extracted from experiments and matched to theory predictions for their flow under the renormalization group.
Our contributions. In this work we alternate between a general development of the theory behind a NN-QFT correspondence and its experimental verification 3 . We do so in the simplest class of architectures admitting such a description, single-layer fully-connected networks. By construction, the point is precisely to not take an infinite width limit N → ∞, instead treating the neural networks as perturbations of the associated GP, modeled by perturbing the GP to an NGP by adding corrections according to the rules of Wilsonian EFT. A summary of our results: • Higher-point GP statistics. We review how n-pt correlation functions in GPs, n > 2, may be computed using non-interacting Feynman diagrams and demonstrate the experimental falloff to GP predictions as N → ∞ in three classes of single-layer networks.
• Neural Net and NGP Likelihoods from Wilsonian EFT. We develop rules for treating the NGPs associated with finite N networks. Non-Gaussian terms correspond to particle interactions and have coefficients ("couplings") corresponding to interaction strengths. We compute the 4-pt and 6-pt functions using interacting Feynman diagrams. In the single-layer networks, 4-pt couplings are extracted from experiments. Two of the network classes yield couplings with small but non-zero variance; in the third they are constants. These couplings are used to make predictions for the 6-pt function, which we verify experimentally.
• Couplings, GP Limit Symmetries, and Technical Naturalness. Whether the coefficient of a non-Gaussian term in the NGP likelihood is a constant or a function is of practical importance. We demonstrate that this may be understood in terms of 't Hooft's notion ("technical naturalness") that the couplings may be small when setting them to zero recovers a symmetry. In our experiments, the coefficient is a constant if and only if the GP limit is translation invariant, i.e., if setting its variance to zero recovers a symmetry.
• Minimal NGP Likelihoods and Renormalization Group (RG) Flow. Our approach introduces a parameter Λ known as the cutoff, so technically there is an infinite class of NGPs designed to describe one set of experiments. If all are equally effective, the predictions must not depend on Λ, which yields a differential equation known as a β-function governing the couplings at different values of the cutoff. The induced flow in coupling space is known as Wilsonian RG flow, and the couplings are said to "run". We compute the β-functions for 4-pt couplings in our examples and experimentally verify their running. As a limit in Λ is taken, the flow renders some coefficients negligible, yielding a minimal NGP likelihood.
These elements, particularly EFT and RG flow, constitute the essentials of an NN-QFT correspondence, as they constitute the essentials of a modern approach to QFT. For our ML readers: this approach immediately connects overparameterization with neural net likelihood simplicity. That is, neural networks with increasingly large numbers of parameters are drawn from increasingly simple distributions. This was already implicit in the connection between asymptotic neural networks and GPs, and also in the fact that non-Gaussian corrections to GPs should be 1/N suppressed. However, the Wilsonian renormalization group adds another layer to the story: at any fixed N there is a family of NGPs indexed by a continuous parameter Λ, and by taking limits in Λ many of the non-Gaussian terms in the NGP likelihood become negligible, leading to simpler NGP likelihoods. The importance of this gained simplicity is particularly acute in our experiments: to excellent approximation, a single number is all that it takes to correct GP correlation functions to NGP (neural network output) correlation functions, despite the fact that in moving away from the GP limit an infinite number of neural network parameters are lost.
We also recognize that our experimental focus on randomly initialized networks, rather than trained networks, means that we have left an understanding of learning to future work. This is intentional, as our primary goal in this work is to develop EFT techniques for treating the NGPs associated with neural networks. We emphasize that though we have not treated trained networks directly, our techniques apply whenever a trained distribution is an NGP. Based on [8,27,28], we expect that this very often the case.
For our physicist readers: in applying EFT techniques to the non-Gaussian processes from which neural networks are drawn, there are a number of important changes from the usual cases in QFT. Some of them include: • Tree-level divergences. In the absence of δ-functions that collapse integrals associated with internal points, divergences may arise in tree-level diagrams. Of course, proper regularization and renormalization renders them a feature, not a bug.
• Lack of derivatives. Gaussian process kernels are functions on input space, but usually do not contain differential operators. Accordingly, they are akin to space-dependent mass terms, rather than kinetic terms. We plan to include derivatives in future work.
• Coupling functions. Most notably, couplings in NGPs are not necessarily constants, and their spatial (input) variance can be understood in terms of technical naturalness. When couplings are functions, we find they still obey appropriate renormalization group equations.
Physicist readers may also notice that we have followed a school of thought in QFT due to Coleman, which implores the student to first understand the basics of perturbation theory and renormalization in the case of spin-0 particles, since the introduction of higher spin particles does not change the basic conceptual framework. Coleman's approach is even more natural in a NN-QFT correspondence, since the scalar outputs associated with neural networks mean that (as functions) they should be understood as scalar fields, which correspond to spin-0 particles. Of course this could change with further developments in neural networks, but it is appropriate for now. Another element of Coleman's school is that perturbation theory in QFT is "just Gaussian integrals" (with some additional decorations), which we review in detail in Appendix A.
This paper is organized is follows. In Section 2 we review Gaussian processes and explain how they correspond to free field theories. We demonstrate the higher-point correlation functions may be computed from two-point statistics using non-interacting Feynman diagrams, and experimentally demonstrate the falloff to GP predictions in the N → ∞ limit. In Section 3 we introduce a treatment of the NGPs associated with finite N networks in terms of Wilsonian EFT, including the introduction of interacting Feynman diagrams for the computation of correlation functions. We extract 4-pt couplings and use them to make 6-pt predictions, which are verified experimentally. Technical naturalness is used to explain when and why couplings should be constants versus functions. In Section 4 we introduce Wilsonian RG as applied to neural networks, computing β-functions that govern the running of couplings and verifying them experimentally. We argue that 6-pt couplings are negligible due to being irrelevant, in the Wilsonian sense.

Asymptotic Neural Networks, Gaussian Processes, and Free Field Theory
In this section we wish to draw a sharp analogy between Gaussian processes (GPs), neural networks, and techniques from free field theory. In this analogy we will facilitate computations of correlation functions in the GP limit with Feynman diagrams, representing correlation functions in terms of the kernel K(x, x ) that determines the GP. We will then specify to a concrete class of neural networks -single layer feedforward networks in their infinite width limit -that exemplify the general idea, demonstrating that theoretical calculations with Feynman diagrams in the GP and associated combinatorics agree with experiments, up through the 6-pt functions.
The essential idea connecting Gaussian processes and free field theory is simple to state. Some classes of neural network architectures admit a limit N → ∞ where a randomly initialized neural network in that class is equivalent to a draw from a Gaussian process, i.e. the neural network GP / asymptotic NN Free QFT inputs (x 1 , . . . , x k ) external space or spacetime points kernel K(x 1 , x 2 ) Feynman propagator asymptotic NN f (x) free field log-likelihood free action S GP Table 1: Correspondence between quantities in the GP / asymptotic neural network and free QFT.
outputs evaluated on fixed inputs are described by draws from a multivariate Gaussian distribution. Meanwhile, a field configuration f (x) in a free field theory is also drawn from a multivariate Gaussian distribution, and it is precisely the Gaussian nature of the associated path integral that makes the theory solvable, i.e. all correlation functions of the fields f (x) may be computed exactly. The free field theory is a GP, with the kernel describing the dynamical propagation of field quanta. Some aspects of this section are stated in the literature, see e.g. [29,30,31], but we wish to fully develop the formalism in order to facilitate new results obtained in later sections. The essentials of the GP-Free QFT correspondence are presented in Table 1.

General Theory
Let us develop the connection between neural networks, Gaussian processes, and free field theory. For simplicity we assume that the mean µ of the GP is zero, corresponding to zero vacuum expectation value (VEV) in the field theory, an assumption which we will relax in subsequent work.
Consider a family of neural network architectures with learnable parameters θ and a discrete hyperparameter N , where at initialization the learnable parameters are drawn as θ ∼ P (θ). The parameter distribution P (θ) and the network architecture together induce an implicit distribution on function space from which the neural network is drawn, P (f ). We will often drop the explicit subscripts θ, N for brevity. For many architectures there is a limit N → ∞ in which the distribution on functions becomes a Gaussian process, which means that the neural network outputs {f (x 1 ), . . . , f (x k )} evaluated on any fixed set of k inputs {x 1 , . . . , x k } are drawn from a multivariate Gaussian distribution which by assumption in this paper has µ = 0. The inverse covariance matrix Ξ is determined by the kernel function K(x, x ) as (Ξ −1 ) ij = K ij := K(x i , x j ). Since µ = 0, the GP is entirely determined by its covariance, which in turn is entirely determined by the kernel. Correlation functions between n outputs can be expressed as and are called n-pt functions, where the partition function Z = df e −S , and S = − 1 2 f i Ξ ij f j is the log-likelihood, or "action" in physics language. Einstein summation is implicit, and f i := f (x i ) is a vector of outputs on a fixed set of inputs {x i } with dimension d in = d.
Of course, the Gaussian process is defined for any {x 1 , . . . , x k } for any k, and therefore it is natural to take the continuum limit, in which case the correlation functions become where the log-likelihood is now Both the discrete and continuum versions of the GP n-pt functions may be computed exactly using standard Gaussian integral techniques reviewed in Appendix A. This would not be the case if the action contained beyond-quadratic terms, though a perturbative expansion may be available for suitably small coefficients of beyond-quadratic terms; see Section 3.
A direct physics analog of a Gaussian process is a free field theory, which describes a quantum field φ(x) without any interaction terms. The field φ(x) depends on the d-dimensional coordinates of space 4 x and the associated field theory is defined by the path integral (2.7) in terms of an action S[φ], which is quadratic in the case of a free field theory. A famous example is free scalar field theory, which has with := ∂ µ ∂ µ and m the mass of the bosonic particle associated to φ. The functional inverse of ( + m 2 ) is known as the propagator, the 2-pt correlation function in the free field theory, and is the analog of the GP kernel. By virtue of being Gaussian, correlation functions in the free theory may be computed exactly.
A central point of this work is that taking an infinite-width neural network to a large-but-finite width neural network corresponds to moving away from the GP to a non-Gaussian process (NGP), which in field theory corresponds to turning on interactions. We will do this in Section 3.

Neural Network Correlation Functions with Feynman Diagrams
Anticipating their utility when we move away from the GP, in this section we derive Feynman rules for the diagrammatic computation of correlation functions in the GP. The ability to represent such computations diagrammatically, in both quantum field theories and Gaussian processes, follows directly from basic properties of Gaussian integrals reviewed in Appendix A.
The partition function of the Gaussian process is where we have included the terms involving the source J(x), and Z GP,0 := df e −S GP is the associated action, or (negative) log-likelihood, is The n-pt correlation functions are defined by We will consistently label GP quantities with a subscript, since in moving away from the GP limit in neural network architectures we will use effective field theory to determine deviations from the GP quantities using the NGP effective actions S = S GP + ∆S.
Since all of the f -dependent terms in Z GP are quadratic or below, it can be evaluated exactly by completing the square and performing the Gaussian integral, yielding where the Z GP,0 factor has canceled. The correlation functions may be written as The basic pattern of the computation of (2.13) emerges from the fact that taking these functional J-derivatives either pulls down factors from the exponential or hits previously-pulled-down Jfactors, using δJ(x)/δJ(y) = δ d in (x − y) repeatedly 5 . The δ-functions make the kernels depend on external points, and depending on n there will be many terms, each with n/2 kernel factors. Those terms contain the information of which external points appear in kernels together, motivating the 5 Details can be found in Appendix A.
With this definition, the procedure of computing G (n) where we write each element p ∈ Wick(x 1 , . . . , x n ) as p = (a 1 , b 1 ), . . . , (a n/2 , b n/2 ). This simple expression may be read "sum over all ways of pairing up elements in {x 1 , . . . , x n }, and in each term write a kernel factor K(a i , b i ) for each of the pairs (a i , b i )," giving a simple rule for writing down the answer for G Both of these simple colloquial expressions yield correct ways to represent (2.15), one in terms of a sum of kernel factors, and the other in terms of a sum of diagrams. The diagram-to-analytic map between the two is clearly that for each line between a i and b i in a diagram 6 , write a kernel factor K(a i , b i ). For instance, in the case n = 2, Wick(x 1 , x 2 ) = {{(x 1 , x 2 )}} and we have In connecting points in pairs for any odd n, there is always a leftover point, which corresponds to a factor of J in every term in the analytic expression. Since J is set to zero after taking functional J-derivatives in (2.13), we conclude that G (n) GP (x 1 , . . . , x n ) = 0 for any odd n. 6 These are, of course, the Feynman diagrams of physics, and the rules are known as Feynman rules.
In the analogy to free QFT, quantities in the GP map to associated quantities in the QFT, summarized in Table 1. Remembering that the GP can sometimes be realized by asymptotic neural networks, the neural network inputs are the points in space in the QFT, and the kernel is the Feynman propagator, which represents the probability or amplitude of propagation of a particle from one point to another. Notably, due to the Gaussian nature of Z GP , all diagrams in the diagrammatic expressions for G (n) GP (x 1 , . . . , x n ) are simple connections of pairs of points in space, flying past one another without interacting. When Z GP is corrected by non-Gaussian terms, "interactions" arise in a way that we will make concrete in Section 3. Thus, the GP / asymptotic neural networks correspond to free (non-interacting) field theories, and moving away from the asymptotic limit corresponds to turning on interactions.

Examples: Infinite Width Single-Layer Networks
To experimentally realize the theoretical ideas of this paper, such as utilizing effective field theory and Wilsonian renormalization group flow for understanding neural networks, we must introduce concrete architectures that will be used in experiments and their corresponding GPs. Specifically, we now introduce the three single-layer architectures we study in this paper and also review the correspondence between infinite width single-layer networks and Gaussian processes.
Consider a fully-connected neural network with one hidden layer and elementwise nonlinearity σ, defined by f (x) = W 1 (σ(W 0 x + b 0 )) + b 1 , where the weights and biases, W i and b i , characterize the affine transformations for each layer. Including the spaces associated with the hidden layers, The weight and bias parameters, collectively labeled as θ, are i.i.d. and drawn from a Gaussian distribution. Specifically, the biases are drawn from N (µ b , σ 2 b ) and the weights in each layer W 0 , W 1 are drawn from N (µ W , σ 2 W /d in ) and N (µ W , σ 2 W /N ), respectively, so they are normalized with respect to the input dimension of the associated layer. The first linear layer takes the input x to a preactivation z 0 which is then acted on by the elementwise nonlinearity σ, giving a postactivation x j 1 = σ(z j 0 ), that is acted on by the final linear layer, yielding the output of the neural network. By the Central Limit Theorem (CLT), in the infinite width limit the network outputs are drawn from a Gaussian distribution on function space [1]; i.e. the network outputs are drawn from a GP. This arises because the output layer of the network defined in (2.20) is the sum of N i.i.d. terms. In the infinite width limit N → ∞, we get a finite 7 sum over these independent parameters. Thus by the CLT we have a neural network output that is selected from a Gaussian distribution, i.e., the neural network evaluated on any finite collection of inputs is drawn from a multivariate Gaussian distribution. This is precisely what defines a GP.
One must derive the kernel associated to a given infinite width architecture in order to compute correlation functions in the associated GP. Kernel derivations are done in detail in Appendix B via two methods: the first method described in [2] computes the exact 2-pt function for all widths, and a second method in [33] applies only in the GP limit. In the architectures that we study, the methods happen to agree for all widths N , though in general it is only required that the former method recover the result of the latter in the infinite width limit. We also build upon the work of [33] in creating a network with a translation invariant Gaussian kernel.
The networks differ only in their activation functions σ. We now introduce the networks that we study via their activation functions and the associated GP kernels.

Erf-net
The network architecture that we call Erf-net is defined by an error function activation The associated GP kernel is , (2.22) which is derived via both methods of [33] and [2] in (B.7).

ReLU-net
The network architecture that we call ReLU-net is defined by ReLU activation function: The associated GP kernel is , 7 Since the elements of W are properly normalized.

Gauss-net
While the two previous examples are well-studied in the literature, such as [2] and [34], in this section we introduce a new activation function in order to obtain a translation invariant GP kernel. This architecture is obtained by adding a normalization layer after the usual exponential activation, according to the process where K exp (x, x) is the 2-pt function of the intermediate exponential activation layer. The resulting activation is 8 , (2.26) which unlike usual activations depends on both the preactivation and the input; we have written it entirely in terms of the input by writing z = W x + b. The associated GP kernel is derived via both the methods of [33] and [2] in B.22. The kernel is particularly simple, a Gaussian 9 in the Euclidean distance between the two inputs. Accordingly, the kernel is invariant under the translation map x → x + c, x → x + c for any constant vector c. The normalization in σ is crucial for translation invariance.

Experiments: Falloff to GP Feynman Diagrams at Large Width
We now wish to demonstrate that in the infinite width limit the experimental results for the n-pt functions G (n) converge to those of the Gaussian process, G (n) GP . For simplicity we will consider the case of a single-dimensional output, d out = 1. Accordingly, we define the deviation in the n-pt function For measuring the size of the deviation with respect to the experimental results, it is convenient to define the normalized deviation m n (x 1 , . . . , x n ) = ∆G (n) /G (n) GP (x 1 , . . . x n ). The 2-pt deviation, given below, is the difference between the experimental 2-pt function and the kernel of the corresponding Gaussian process.
f α (x i ) denotes the output of the α th network for the input x i . The 4-pt and 6-pt deviations are expressed using Wick contractions of products of the kernels evaluated at Wick pairs p = (x i , x j ). The 4-pt deviation is given by The last line follows from (2.17).
The 6-pt function is obtained similarly using products of three kernels, where we will abbreviate K(x i , x j ) =: K ij going forward. It is given by (2.31) We remind the reader that in Section 2.2 we discussed a simple check of the combinatorics: the Wick contractions for G (n) GP (n even) involve the number of ways of connecting n points in pairs, which is (n − 1)!! = (n − 1) (n − 3) . . . 1. For the n = 6 case this predicts 15 terms, which we see explicitly in the last equality of (2.31).
(1, 0) for all three architectures. To do this, for each architecture we calculate the n-pt correlation functions of outputs f α (x i ) from ten experimental runs with 10 6 networks each, with weights and biases θ drawn as described in Section 2.3 with µ W = µ b = 0 and σ b = σ W = 1 for Erf-net and Gaussnet. Since our experiments have We choose ReLU-net to have a bias of 0 for scaling reasons explained in Section 4.2, so σ b = 0 for this case. The deviations ∆G (n) are easily computed given the measured n-pt functions and the theoretically computed G To study the falloff to GP predictions G (n) GP as N → ∞, the n-pt deviations ∆G (n) are normalized by the GP prediction to obtain a measure m n = ∆G (n) /G (n) GP , with input dependence implicit. These are plotted in Figure 1 in comparison to a background defined to be the average elementwise standard deviation (across the 10 experiments) of the experimental m n , as a function of width N , where for each width the solid blue line is the mean, and error bars are the 95% confidence interval of all the entries in m n . A nontrivial experimental ∆G (n) is above the background. Figure 1 shows that 2-pt deviations ∆G (2) are below the background level and therefore consistent with zero for all three architectures, indicating that the kernel is an exact measure of the 2-pt correlation function even away from the GP. This is expected, since in Appendix B it is shown that the GP kernels associated to the architectures we study are the exact two-point functions at all widths; from here on we therefore use kernel and 2-pt function interchangeably. The 4-pt and 6-pt signals are linearly decreasing (on a log-log scale) with increasing width in the region above the background level, falling below the background level at higher widths. The slope of the line in the region above the background is −1, indicating in our experiments that for n = 4, 6. This explicitly demonstrates the falloff to GP as the width increases. At large width, therefore, we see that these networks are drawn from GPs, with their statistics entirely determined by Wick contractions of the appropriate kernels, yielding G (n) GP . At small width, the GP prediction no longer correctly predicts the experimental n-pt function; the neural networks are not drawn from a GP, but instead an NGP. Our goal is to develop a method for capturing non-Gaussian corrections to the log-likelihood.

Neural Networks and Non-Gaussian Processes with Effective Field Theory
In this section we propose using an approach to quantum field theory known as Wilsonian effective field theory (EFT) to understand and analyze neural networks away from their GP limits. From an ML point of view, it can be thought of as a useful way to determine minimal log-likelihoods of NGPs by determining the most relevant non-Gaussian corrections to the GP.
The essential idea is that the GP action S GP does not suffice to determine correlation functions G (n) away from the GP limit, as demonstrated experimentally in Section 2.4, but the principles of EFT allows for the determination of corrections ∆S to the GP log-likelihood, yielding an effective log-likelihood for the NGP 10 associated to the finite-width neural networks, The organizing principles of EFT are symmetries and scales, which together allow for the determination of an appropriate ∆S that may have its parameters fixed by experiments. It may technical be used to make effective (correct) predictions for other experiments. For the ML reader, let us briefly tour these ideas in physics, where effective field theories correctly describe a vast array of systems, from superfluids and superconductors in condensed matter physics to beta decay and elementary particle interactions 11 in high energy physics. As a concrete example, consider beta decay, a process by which a neutron decays into a proton, electron, and anti-neutrino, n → p + e − +ν e .

(3.2)
From the perspective of the Standard Model, we have a deep knowledge of this process: the neutrons and protons are made up of quarks and the decay process is mediated by an intermediate W-boson. However, this detailed knowledge did not keep Fermi from arriving at an EFT description of the process in 1933 [35], long before quarks and W-bosons were even theorized, let alone discovered over 35 years later. Though symmetries permeate both Fermi's theory and the Standard Model, the crucial insight is one of energy scale: Fermi's interaction that effectively describes beta decay is obtained from the Standard Model by going to lower and lower scattering energies, at which point the contribution from the intermediate W-boson is negligible and an effective four-fermion interaction suffices to describe the process.
In this spirit, we propose understanding neural networks away from GP limits in terms of effective field theory. Specifically, EFT allows for the determinination of an NGP effective action, with its parameters fixed by experiments, which is then used to make verifiable predictions.
Rather than organizing according to length or momentum scale, the NN-QFT correspondence that we are developing replaces fields as a function of space with neural networks as a function of input, suggesting organization of the problem according to input scale. The interpretation of this in ML depends on the problem, but an example of "input scale" in images would be brightness.
As in physics, we utilize the quadratic terms to determine classical scaling dimensions, which here are given by the probability of an infinite-width neural net (GP draw) f Specifically, S GP must be dimensionless; it scales as x 0 , and we write [ This in turn may be used to determine the dimensions of the coefficients of operators that might appear in ∆S. For instance, consider operators Unlike in many physical cases, for [K] ≥ 0 the couplings g k have dimensions of the same sign ∀ k.
In Section 4 we will use arguments from Wilson's picture of the renormalization group to argue that operators O k can be safely ignored for sufficiently large k.
How does one construct the effective action of an NGP associated to a neural network architecture that admits a known GP limit? Wilsonian EFT dictates the following rules: • Determine the symmetries (or desired symmetries) respected by the system of interest.
• Fix an upper bound k on the dimension of any operator appearing in ∆S.
• Define ∆S to contain all operators of dimension ≤ k that respect the symmetries.
Since the GP limit is known, the NGP is defined by S = S GP + ∆S. As we will see in a moment, this allows for the determination of correlation functions. By experimentally measuring them, one may fix coefficients of terms in ∆S and make subsequent predictions.
interacting field log probability effective action S Table 3: Correspondence between quantities in the NGP / finite-width neural network and QFT. See text for discussion on whether the kernel is the free or exact propagator.
The desired symmetries may be determined by architecture considerations or, e.g., by demanding that the NGP respects the same symmetries as its GP limit, in which case the NGP symmetries are the symmetries of the GP kernel K(x 1 , x 2 ). The choice of a relevant value of k is sometimes dictated by the system of study. For instance, Fermi knew that his theory must have only spin-1/2 particles 13 and that four of them must interact to describe beta decay. In QFT this requires a term schematically of the form ψψψψ, where ψ is a field associated with a spin-1/2 particle. Since In the examples we study in this paper, we will see that it is crucial it have k ≥ 4.

Correlation Functions in NGPs with Interacting Feynman Diagrams
Having introduced EFT rules that allow for the determination of ∆S, we must introduce a method for computing NGP correlation functions. Before doing this explicitly in the cases of interest, we must briefly introduce the basics of cutoffs and perturbation theory.
First, perturbation theory: consider an NGP associated with a finite-width neural network architecture, with associated effective action S = S GP +∆S. The GP correlation functions G (n) GP were exactly computable precisely because the action S GP was Gaussian, so non-Gaussian corrections in ∆S prevent the NGP n-pt correlation functions from being computed exactly. However, if the coefficients of operators in ∆S are appropriately small, approximating the n-pt functions using perturbation theory is possible. In QFT, this corresponds to the existence of non-trivial interactions, where the interaction strength being small yields a correction to the leading process. As an example, consider corrections to the n-point func- Multiplying the numerator and denominator by 1/Z GP,0 and expanding under the assumption of small g k , we have Truncating at a desired order in g k (here just the leading correction), one obtains an approximation for the n-pt function where the numerator and denominator may both be computed via 13 Since the W-boson mediator had not been observed yet.
Wick's theorem, and both may be represented diagrammatically 14 . The O(g 0 k ) contribution in the numerator is precisely the GP n-pt function G (n) GP . The O(g 1 k ) term in the numerator may be computed via Wick's theorem, with one of the associated diagrams being a tree-level (no-loops) diagram by which the n external points {x 1 , . . . , x n } connect to the point x that is integrated over; the latter is known as the interaction vertex, and is referred to as an internal point. These calculations and vacuum bubble cancellation is reviewed in Appendix A.
We now introduce cutoffs. Computing NGP correlation functions via perturbation theory leads to integrals over input space that naively yield divergences. A simple way to treat the divergence is to cut them off by the replacement where S Λ differs from S only in the fact that all integrals over input space are bounded from below by −Λ and above by Λ, where Λ is positive and known as the cutoff. In QFT it is either integrals over space or momenta that are cut off, with the justification that theories have finite regimes of validity; e.g., scattering experiments done at one momentum scale should not be valid up to arbitrarily high momenta. Accordingly, low energy experiments with momenta |p| Λ should be insensitive to the choice of Λ. This requirement imposes that the coefficients of operators in S must obey a differential equation known as the Wilsonian renormalization group equations (RGEs). We will discuss this at great length in Section 4.2, and demonstrate that, indeed, the NGPs associated to finite-width neural networks satisfy appropriate Wilsonian RGEs.
Our discussion of perturbation theory focused on a particular term in ∆S, however, it is of course possible to have many terms in ∆S, each with its own coefficient to expand in, if it is small. This gives a general prescription for approximating correlation functions in neural network NGPs, which may be written as Feynman diagrams via the development of appropriate Feynman rules.
Of course, we are interested in doing neural net experiments that validate theoretical predictions, and so we again focus on the finite width single-layer networks introduced in Section 2.3. We have a single output, d out = 1, and therefore one might consider non-Gaussian terms of the form However, all odd-point functions in our experiments must be zero, since the means of the weights and biases are zero. This motivates g = α = 0, since when expanded to linear order those terms yield non-trivial contributions to the 3-pt and 5-pt functions, respectively. An intuitive way to see this is that in our randomly initialized neural nets f and −f should be on equal footing, and thus S must have an f → −f symmetry (i.e., be invariant under this transformation) that would be broken by either g = 0 or α = 0. Furthermore, in Section 4 we will explain why for sufficiently large Λ, κ must be negligible, or irrelevant in the sense of Wilsonian RG; however, we will consider both until Section 4. By these considerations, the effective action we will utilize for the remainder of the paper is In our experiments we will also see that κ is negligible, and therefore a single quartic correction will be sufficient to explain our finite-width neural net experiments. With this effective action for the NGP, one may compute correlation functions in perturbation theory. Equivalently, one may represent the correlation functions diagrammatically by the following Feynman rules. They may be stated in different ways according to the goals at hand. Here we state them in a way relevant for computing the O(λ l κ m ) correction to G (n) (x 1 , . . . , x n ), where the "interaction vertices" are y j (j = 1, . . . , l) and z k (k = 1, . . . , m).

1) For each of the n external points
3) Determine all ways to pair up the loose ends associated to x i 's, y j 's, and z k 's. This will yield some number of topologically distinct diagrams. Draw them with solid lines.
4) Write a sum over the diagrams with an appropriate combinatoric factor out front, which is the number of ways to form that diagram. Each diagram corresponds to an analytic term in the sum.
7) Throw away any terms containing vacuum bubbles; see Footnote 14.
As a non-trivial check, after step 2) there are 6m z k loose ends, 4l y j loose ends, and n x i loose ends, and there are (n + 4l + 6m − 1)!! ways of connecting these in pairs, which must be the sum of the coefficients of the topologically distinct diagrams, including vacuum bubbles.
We strongly emphasize that the cases relevant for our experiments require an important modification to what we have presented thus far, which is correct when S = S GP + ∆S, and ∆S is comprised of only non-Gaussian corrections, i.e. S GP is the only Gaussian part of the action. In that case the two-point function is (3.13) In particular, K(x 1 , x 2 ) is the analog of the free-theory propagator in QFT, and it is only the leading piece in the 2-pt function, which receives corrections from interactions. In the architectures used in our experiments, however, we remind the reader that the GP kernel is the exact 2-pt function for the NGP as well as experimentally demonstrated in Section 2.4 and theoretically derived in Appendix B. In particular this means that S = S GP + ∆S, but instead S = S G + ∆S for some other Gaussian action S G , defined to be the action such that (3.14) holds. One can in principle compute S G , but a simpler modification that we employ is to modify the Feynman rules for computing the O(λ l κ m ) correction to G (n) (x 1 , . . . , x n ). The complete new Feynman rules in this case are 1) For each of the n external points x i , draw x i .
2) For each y j , draw y j . For each z k , draw z k .
3) Determine all ways to pair up the loose ends associated to x i 's, y j 's, and z k 's. This will yield some number of topologically distinct diagrams. Draw them with dashed lines.
4) Write a sum over the diagrams with an appropriate combinatoric factor out front, which is the number of ways to form that diagram. Each diagram corresponds to an analytic term in the sum.

4.5)
Throw away any diagram that has a component with a λor κ correction to the 2-pt function.
7) Throw away any terms containing vacuum bubbles.
Despite having presented the complete list, note that the only changes are the dashed lines in steps 3 and 7, and the new step 4.5.
Diagrams thrown out at that step should not be included in this case because in writing K(x 1 , x 2 ) for x 1 x 2 in Step 6, the λand κcorrections to the 2-pt function are already included, since the exact two-point function is the GP kernel. An example of such a diagram is of the form which is implicitly included in the term of the form when the exact 2-pt function is used. It is easy to identify similar diagrams that must be thrown away as they arise. Of course, this type of modification arises for any NGP in which (3.14) holds. The (n + 4l + 6m − 1)!! combinatorics may still be used as a non-trivial check prior to discarding diagrams in steps 4.5) and 7). Finally, if we draw in a case where (3.14) holds, it means the propagator / kernel of S G .

Four-point and Six-point NGP Correlation Functions
We now turn to the computation of the 4-pt and 6-pt functions, since they will be studied in our experiments. We emphasize, however, that the calculations here are valid for the stated orders in λ and κ for broader classes of NGPs associated with neural network architectures.
Since in our experiments the NGP two-point function G (2) is exactly the kernel in the GP limit, i.e. (3.14) holds, we are in the case where S = S G + ∆S. Accordingly we use the second set of Feynman rules in which represents the propagator of S G and represents the exact two-point function. We may represent the two-point function in perturbation theory as where the second diagrammatic equation represents the analytic expression via the second set of Feynman rules, critically relying on step 4.5).
The 4-pt and 6-pt functions may be computed similarly. To simplify the Feynman diagrams we add the additional rule that we do not label external points, which is to be interpreted as summing over all combinations of external points. The 4-pt function is and the 6-pt function is + K 1y K 3y K 4y K 6y K 25 + K 2y K 3y K 4y K 6y K 15 + K 1y K 2y K 5y K 6y K 34 + K 1y K 3y K 5y K 6y K 24 + K 2y K 3y K 5y K 6y K 14 + K 1y K 4y K 5y K 6y K 23 + K 2y K 4y K 5y K 6y K 13 + K 3y K 4y K 5y K 6y K 12 where again the analytic expression corresponds to the second set of Feynman diagrams according to the Feynman rules. Note the simplicity of the second diagrammatic representation relative to its corresponding analytic expression.

Neural Network Coupling Constants and GP Symmetries
We have argued that non-Gaussian corrections are crucial in neural networks and they correspond to turning on interactions from the QFT point of view. Such corrections arise as terms in ∆S, each with a coefficient encoding the interaction strength, such as λ or κ in (3.12).
For convenience we have thus far ignored important questions about any interaction coefficient appearing in ∆S: are they constants? In QFT language such quantities are called coupling constants, but this is simply an artifact of the translation invariant QFTs that physicists most often study. For instance, in the Standard Model of particle physics a parameter analogous to λ quantifies the self-interaction of the Higgs boson, and all such coupling constants have historically proven to be invariant under spatial translations on Earth; they are independent of whether the scattering experiment is done at CERN in Switzerland or Fermilab in Illinois. This follows from the translation invariance of the entire Standard Model action, which therefore also predicts that the same couplings would be measured in proton scattering experiments near Alpha Centauri. That is, couplings are constants in spacetime.
But in NGPs associated with neural network architectures, do we expect 15 the coefficients such as λ and κ to truly be constants? What role should translation invariance play in such considerations? To this end we employ a powerful principle due to 't Hooft: Technical Naturalness: a coupling g appearing in ∆S may be small relative to Λ if a symmetry is restored when g is set to zero.
A concept underlying technical naturalness is that g itself may receive large corrections in varying the cutoff 16 Λ, i.e. in varying Λ → Λ , However, in some cases it is possible to show that ∆g = 0 if S has a symmetry. If g itself breaks the symmetry, but the symmetry is restored when g → 0, then as g → 0 we must have ∆g → 0 as well. In some concrete cases, this is enforced by ∆g = g α where α is some function mild enough that ∆g → 0 as g → 0.
If g is small, this ensures that corrections to it are also small, so that g is small for all values of the cutoff. In such a case, g is said to be technically natural. A simple example is the electron mass m e in quantum electrodynamics: m e is small relative to the electroweak scale and corrections to the mass are ∆m e ∝ m e , ensuring that it remains small as the cutoff is varied. As m e → 0, a symmetry of electrons called the "chiral symmetry" is restored.
The small electron mass m e is technically natural. Technical naturalness is directly applicable to our discussion of coupling constants versus coupling functions in NGPs associated with neural networks. For instance, is λ a constant or a function of the input space? To examine this question in light of technical naturalness, consider a simple case where λ is the only non-Gaussian coefficient and is decomposed as with a constant pieceλ plus some non-constant piece δλ(x). The variance of λ is determined by δλ, which in general is not invariant under translations T : x → x + c. When δλ = 0, λ is a 15 In our Feynman rules, we have left the coefficients λ and κ inside the integrals in anticipation of this question. 16 That is, due to Wilsonian renormalization group flow, as we will exemplify in Section 4.
constant, and when it is small, λ is effectively a constant. But should δλ be small? This is where technical naturalness is useful. It states that is reasonable to expect if there is a symmetry in the δλ → 0 limit. Since δλ breaks T , a relevant question is whether T is restored when δλ → 0. This occurs when S GP , and specifically its kernel In examples with multiple couplings, the relevant question is whether T is restored when all couplings go to zero, i.e. again whether or not K(x, y) is T -invariant. Technical naturalness therefore leads to the following concrete conjecture: Conjecture: couplings in NGPs associated to neural network architectures are constants (or nearly constants) if the kernel K(x, y) associated with their GP limit is translationally invariant. This is a conjecture rather than a theorem because technical naturalness is not proven in general, but is instead a guiding principle in physics. In our experiments, however, the conjecture can be tested: while the kernels associated with Erf-net and ReLU-net are not T -invariant, the Gaussnet kernel is T -invariant. We will verify the conjecture in these examples by demonstrating that couplings are effectively constants in the Gauss-net case, but not the other cases 17 .

Experiments: Correlations in Single-Layer Networks with EFT
Having explained how to utilize effective field theory to analyze neural network architectures and their associated NGPs, in this section we verify the ideas experimentally. The logic is simple, as effective field theory techniques are supposed to accomplish three things: • Give a candidate ∆S for the NGP.
• Fix coefficients of operators in ∆S with experiments.
• Once fixed, make predictions for other experiments and verify them.
In this section, we will use experimental measurements of G (4) at fixed width to determine λ, and then use the determined λ to make predictions for G (6) at the same width and verify it against experiments.
The coupling λ may be extracted from experimental measurements of G (4) . When κ is negligible 18 and λ is a constant, (3.18) gives . (3.23) 17 However, we will see that aspects of Wilsonian RG flow persist even when couplings are functions. 18 We will verify this in the large cutoff limit experimentally, and theoretically in Section (4.2).
Therefore, by measuring G (4) (x 1 , x 2 , x 3 , x 4 ) in experiments and performing the theoretical computations for the rest of the expression, λ may be experimentally measured. Slight complications arise when, more generally, the coupling is a function λ(y) =λ + δλ(y) (3.24) withλ a constant. In that case (3.18) gives where K ij abbreviates K(x i , x j ) and ∆ 1234y = K 1y K 2y K 3y K 4y . If δλ is small then the first term dominates, which may be measured experimentally by measuring G (4) and then using the known theoretical expressions of the GP kernels to compute λ m , which becomes a rank four tensor on a fixed set of inputs. Sinceλ is a constant and (3.25) holds, variance in λ m correlates with variance in δλ, and when the variance is small The approximation is exact in the limit of constant λ, i.e. δλ → 0. Therefore, by measuring λ m we approximately measure λ as mean(λ m (x 1 , x 2 , x 3 , x 4 )) 19 , which we refer to as the measured value of λ. Given the measured value of λ, we would like to use it to make theoretical predictions for G (6) that may be experimentally verified. To this end, we define the 6-pt deviation δ (x 1 , . . . , x 6 ) := G (6) (x 1 , . . . , where the last equality follows from (3.19). To obtain a sense of the size of the deviation we For perturbation theory to hold, |δ(x 1 , . . . , x 6 )| 1 is required. Since κ is negligible in the limit of large cutoff Λ, we expect |δ(x 1 , . . . , x 6 )| to converge to 0 in that limit. For each NN architecture, |δ(x 1 , . . . , x 6 )| is given by the difference between the orange line and 1.0 in Figures 2, 3 and 4.

Gauss-net
In Figure 2 we show the results of Gauss-net experiments, which were carried out with the same set of parameters parameters introduced in Section 2.3. All experiments in this section were performed with n nets = 10 6 . The fixed-cutoff experiments are done at In the left plot we present measurements for the rank-4 tensor λ m at fixed cutoff. Despite being a rank-4 tensor, we see that there is very little variance across the elements, indicating that λ m is effectively constant; we will discuss this in a moment. In the right plot we present the GP contribution to G (6) , normalized by G (6) so that a correct theoretical prediction would be 1; we see G (6) GP falls short of this. However, we also demonstrate excellent match to experiment when the perturbative correction in λ is added, implying that δ is effectively zero for all values of the cutoff; the EFT has correctly predicted the experimental measurements of the 6-pt function.
As discussed in Section 3.3, by technical naturalness a parameter is allowed to be small when setting it to zero restores a symmetry. We observe from the (lack of) variance in the left plot in Figure 2 that δλ/λ is indeed small, i.e. λ is effectively a constant. This is expected because λ is technically natural for Gauss-net: translation invariance is recovered when δλ vanishes, as the Gauss-net GP kernel is translation invariant. In fact, this was by design. We started with a translation-invariant kernel since we desire an example with coupling constants rather than coupling functions, which we expected would be enforced by technical naturalness.

ReLU-net
Further, we read off λ m tensor and its components from ReLU-net in Figure 3. The fixed-cutoff experiments have the same parameters as Gauss-net, but the fixed-width λ m experiments and G (6) predictions are done at width N = 50. The widths were chosen so the 4-pt and 6-pt functions were NGPs and above the background level, as seen in Figure 1. As with Gauss-net, we show in the left plot all elements of the rank-4 tensor λ m at various widths and fixed cutoff Λ = 10 5 . The central line representsλ, which is the mean of the tensor elements of λ m , and the shaded region represents the 95% confidence interval. The variance across the tensor elements is nontrivial but small, indicating a coupling with a small space-dependent piece. Using the measured value λ, we show the results of the corresponding G (6) prediction in the right plot. We see similarly that the GP prediction G GP does not account for the 6-pt function G (6) , and the λ contribution gives an additive correction.
It is not expected that any symmetry is restored as δλ/λ vanishes. Specifically, the ReLU GP kernel is not translation invariant, and therefore δλ/λ cannot be argued to be small on the grounds of technical naturalness with respect to T . We experimentally see that δλ = 0, i.e., λ has a piece that varies with input, though δλ λ in this case. Erf-net G (6) prediction, N = 10 GP +λ contribution)/G (6) Figure 4: (left): Measured λ m tensor elements at different widths for Erf-net with fixed cutoff. (right): GP prediction alone, and GP prediction +λ correction of 6pt function G (6) for Erf-net at width N = 10, normalized with respect to G (6) .

Erf-net
We obtain λ m tensor and its independent components from experimental results for the third and last architecture Erf-net in Figure 4. As with ReLU-net, the fixed-cutoff experiments have the same parameters as Gauss-net, but the fixed-width experiments were performed at width N = 10. At this width we again see the 4-pt and 6-pt functions are that of NGPs and the signal is above the background level in Figure 4.
As with the previous two networks, we show all elements of the tensor λ m in the left plot at the given widths, where the central line is again the meanλ, and the shaded region represents the 95% confidence interval. The variance across the tensor elements is larger than the other networks but small when compared to the mean valueλ. The right plot shows the GP prediction G (6) GP being corrected by theλ term, and together they give a very good approximation of the 6-pt function G (6) . As with the case of ReLU, technical naturalness with respect to T does not apply here, which we see born out experimentally by the fact that δλ = 0.

Minimal Non-Gaussian Process Likelihoods with Wilsonian Renormalization
In Section 3 we demonstrated that we can use a Wilsonian EFT approach to neural networks to make experimentally verifiable predictions for correlation functions of the associated NGPs. This procedure can be carried out for any fixed cutoff Λ that yields a perturbative NGP. In doing so we claimed that we could ignore higher-than-quartic terms in the NGP effective action, and promised to justify the assumption. We do so now, and will also explain and experimentally demonstrate the relationship between the EFTs at different values of Λ.
The correct physics framework for these ideas is the Wilsonian renormalization group. This seminal idea in quantum field theory has led to some of the deepest results in both condensed matter and high energy physics, including multiple Nobel prizes, and is the subject of extensive discussions and presentation in any modern QFT textbook. We will present a streamlined perspective on it that is relevant for understanding applications in neural networks.
Our explanation of the renormalization group applied to neural network NGPs will be grounded in experiments 20 , which applies in particular to the sort of neural network experiments that we have carried out. Consider any experiment that draws an ensemble of neural networks f α and computes their outputs on a fixed set of inputs (4.1) Given these measured outputs, the experimental correlation functions are measured via which contain essential information about the distribution of neural networks induced by the chosen architecture and parameter distribution, as well as data if they have been trained. These results are concrete experimental facts, and the goal of theory is to explain them.
To that end, noting that wide varieties [33] of both trained and untrained networks admit a limit in which they are drawn from Gaussian processes, we propose constructing theories of the neural networks away from the GP limit by perturbing it to an NGP, encoded in where the sum is over all operators O l of scaling dimension l ≤ k for some fixed k and the operators are themselves functions of neural network outputs; they are monomials in our cases. This yields an action for the NGP, S NGP , with distribution Using standard techniques reviewed in Appendix A and carried out in Section 3, correlation functions G (n) (x 1 , . . . , x n ) may be computed perturbatively in the coefficients g O l and matched to experiments, where the GP kernel plays a crucial role, although GP predictions for n-pt functions are modified by various "interaction" terms of strength g O l .
However, in carrying out such computations one regularly encounters divergences arising from integrals over the space of inputs to f , which are neural network inputs in our framework, and usually position or momentum space in QFT. Needless to say, these divergences prevent the theoretical correlation functions from matching the finite ones measured in experiments. In QFT these are infinities that were "swept under the rug" in complaints from the 1960s, but this viewpoint is archaic and incorrect from a modern perspective, since the issue is solved by a proper understanding of renormalization, particularly the one developed by Wilson.
Since the divergences arise from the boundaries of the ∆S integral being ±∞, it is natural to attempt to get rid of them by cutting them off, as was physically motivated in Section 3.1. That is, replace (4.3) by (4.5) Doing this for some fixed value of Λ such that |x i | Λ for all x i ∈ S in , one may use the experimental results for G (n) (x 1 , . . . , x m ) to extract the values of g O l (Λ) at that value of Λ, which may then be used to make other predictions that may be experimentally verified; this is what we did in Section 3.4. In general, since the operators are truncated at dimension l ≤ k there are a finite number of g O l (Λ), yet an infinite number of n-point functions receive corrections from them, i.e., we have a finite number of quantities are making an infinite number of predictions.
We now arrive at the essence of the Wilsonian renormalization group. In passing from (4.3) to (4.5), we passed from one ∆S to a one-parameter family ∆S Λ with parameter Λ. Yet there is one set of experimental n-pt functions and a continuous infinity of theories associated with ∆S Λ , each of which by the above process may make correct predictions for the experiments. It is clear, therefore, that Applying this to the theoretically computed n-pt functions yields differential equations that describe how the couplings g O l (Λ) vary as Λ is varied. This is to be expected: if we have an EFT making correct predictions at some value of Λ with x i Λ, we should be able to do the same thing for infinitesimally shifted cutoff Λ + δΛ, which will lead to slightly different couplings g O l (Λ+δΛ). The differential equations are referred to as renormalization group equations (RGEs), which include the β-function associated to the coupling, defined as which may be extracted from (4.6). The RGEs give rise to a flow in the couplings induced by varying Λ, known as renormalization group (RG) flow. Now consider an RG flow in the space of couplings that begins at a point in coupling space where they are all non-trivial. Given this initial condition, the β-functions determine a trajectory through coupling space and the opposite directions along the trajectory correspond to raising and lowering Λ. Couplings that decrease, increase, or stay the same along one direction of the flow are said to be irrelevant, relevant, and marginal, respectively; their corresponding operators inherit the same names. The names are clear: go far enough along an RG flow and the irrelevant couplings go to zero, they may be ignored. Upon switching the direction of the flow, the names are reversed, though in physics one often considers unidirectional flows to low energies (long distances) since at higher energies the flows may be altered by the existence of a yet-undiscovered particle.
Consider the effective action (3.12) in light of this discussion of irrelevant, relevant, and marginal operators. It contains 4-pt and 6-pt couplings only, λ and κ. From (3.7), the couplings associated have input-space dimensions . In the limit of large cutoff, the n-pt functions receive negligible corrections from κ, in comparison to corrections due to λ, and one may effectively ignore κ. Given that λ is irrelevant, one may wonder why it cannot also be ignored. The proper prescription is that the leading operators necessary to account for a given phenomenon must be included. If we ignored all couplings g k since they are irrelevant, we would be back in the GP limit and unable to account for the finite width experiments. To explain the experiments, one must keep the most relevant coupling; g 3 = 0 since the 3-pt function is zero, and therefore λ = g 4 is the leading coupling, despite being termed irrelevant.
As an example of this phenomenon in physics, consider an effective field theory of the blue sky, which describes the scattering of light off of neutral spin-0 particles in the atmosphere. The lowest dimension operator that can describe this scattering process is where the φ field and its conjugate represent the spin-0 particles and F is the gauge-invariant field strength tensor of electromagnetism associated with the photon. We have used the particle physics convention of writing the coupling as a dimensionless c with the cutoff explicitly introduced. This operator is irrelevant, but since it is the lowest dimension operator that gives rise to gauge-invariant scattering of light off of neutral atoms, it must be included in the EFT. In fact, it reproduces the Rayleigh cross-section that explains why blue light scatters more strongly than red light.

Neural Network Non-Gaussian Process Flows with β-functions
Having introduced the central ideas of the renormalization group, we now address more concretely how these β-functions may be extracted from cutoff-dependent correlation functions, organizing the calculations according to model (architecture) independent and dependent pieces. This split arises naturally for some architectures. For instance, some terms in the kernels are shared amongst all single layer fully-connected networks, while others depend on the specific architecture, as determined by the activation function.
The kernels associated to a class of neural network architectures can be expressed in terms of a model independent (within the architecture class) term α and a model dependent term ς.
where we assumed that the first and second terms are input independent and dependent, re-spectively, since it is true in the networks we study and more generally to deep fully-connected networks; it is straightforward to study the case where the first-term is input-dependent, as well. For instance, the kernels in Section 2.3 allow for explicit comparison. Substituting this form of the kernel into the 4-pt function in (3.18) lets us rewrite the G (4) (x 1 , x 2 , x 3 , x 4 ) in terms of model independent terms γ 4,i and model dependent terms 4,i . The subscripts 4 and i refer to the order of the correlation function and O(i) corrections to the GP expression respectively.
Similarly (4.10) can be substituted into the 6-pt function given in (3.19) to express it in terms of model independent terms γ 6,i and model dependent terms 6,i , as the following Irrespective of the NN structure, the terms γ n,0 and n,0 are independent of the integration variable x, as they are tree level corrections to the n-pt functions, and independent of any interaction vertices in respective Feynman diagrams. The RG equations can be obtained by taking derivatives of the 4-pt and 6-pt function with respect to log of the cutoff scale Λ, (4.14) In the limit of large Λ, κ is negligible, and the last terms in (4.13) and (4.14) vanish. In that case, (4.13) and (4.14) can be simplified to the following Solving the above gives us the RG equation of λ, β(λ).

Experiments: Flows in Single-Layer Networks
Erf-net In the large cutoff limit where κ becomes negligible, as in our experiments, the RG equation for coupling λ of Erf-net can be read off from the variations of G (4) with respect to log(Λ). Here we will derive the Erf-net RG equation for λ, and in doing so, show that the scaling dimension of the Erf-net kernel is nonzero, thus the 6-pt couplings are less relevant than the 4pt couplings to effectively compute all correlation functions of the NGP. Recall that the Erf-net kernel is given by .

(4.15)
Comparing this with (4.10) let us identify the model independent and model dependent terms of Erf-net as .

(4.16)
We can define a family of functions ς x i (x) parameterized by an external vertex x i and an internal vertex x, ς x i (x) := ς(x i , x). Note that for our experimental values σ b = σ W = 1 and d in = 1, 2, 3, and our choice of external vertices x i , we have The scaling dimension of the kernel is [K] 0, but a careful analysis shows that in the regimes we are studying [K] = > 0 for some very small computable . The scaling of the couplings can be obtained from (4.8) as [λ] = −d in − 2 and [κ] = −d in − 3 , showing that κ becomes irrelevant faster than λ as the cutoff scale Λ increases.
In the limit of large cutoff Λ, we ignore κ-dependent terms and the 4-pt function given in (3.18) can be re-expressed in terms of the kernel's constant term and model-dependent term: α and ς x i (x) as in (4.10): where for appropriate choices of c 1 , c 2 , c 3 , c 4 ; this follows from the bounds on |ς x i (x)|. This indicates that the ξ i scale with Λ d in or some smaller powers of Λ. In limit of large Λ, the RG equation of λ can be obtained by taking derivatives of the 4-pt function at O(λ) with respect to log(Λ), as the following If ξ i scales as Λ d in , we have  to this order in perturbation theory. Integrating, we obtain where p 1 is the constant of integration, a form useful for comparing to experiment. Next, we verify the above prediction against the variations in experimental value(s) of λ m at different cutoff scales Λ using the parameters from Table 2 for d in = 1 and inputs Table 4 for d in = 2, 3. For this, (3.26) is used to obtain the rank-4 tensor λ m for each choice of cutoff scale Λ, given in (3.31). We plot all independent tensor elements of λ m against respective cutoff scales Λ, in a log-log plot, for various values of d in . This shows both the input-dependence of λ m , as well as its variations with respect to the cutoff scales.  The best fit lines in Fig. 5, and specifically their slopes, are in excellent agreement with the theoretical prediction at (4.23).

Gauss-net
In a similar way, in the limit of large cutoff scales, where κ becomes negligible, the variations in 4-pt functions of Gauss-net with respect to log(Λ) can be used to obtain the RG equation of λ for Gauss-net.
Recall that the Gauss-net kernel is given by Comparing this with (4.10) let us identify the model independent and model dependent terms as Note that |ς(x, x )| < 1 ∀ x = x ; and |ς(x, x )| = 1 when x = x . The scaling dimension of the kernel is [K] 0, but a careful analysis shows that in the regimes we are studying [K] = < 0.
The scaling of the coupling constants can be obtained from (4.8) in the same way as Erf-net, giving This suggests that λ scales to zero slightly faster than κ, but we see in our experiments that the λ contributions nevertheless dominate over the κ contributions. For instance, the 6-pt predictions for Gauss-net needed only the λ contributions. This will be reinforced below, when we see that the experimentally correct RGEs may be computed using only the λ contributions. The origin of this apparent small discrepancy could be initial conditions for κ and λ prior to scaling, an unseen symmetry, or a breakdown of naive scaling arguments. We find the explicit RGE calculations below and their match to experiments the most compelling. We can define a family of functions ς x i (x) parameterized by an external vertex x i and an internal vertex x. In the limit of large cutoff Λ, the 4-pt function given in (3.18) can be re-expressed as below, in terms of α and ς x i (x): where This ensures that ξ 1 d in Λ 0 for any Λ above some O(1) number. Since this holds and |ς x i (x)| < 1, it can further be shown that At large Λ, the RG equation of λ can be obtained by taking derivatives of the 4-pt function at O(λ) with respect to log(Λ), as the following We have shown that ξ i (Λ) and ∂ξ i (Λ) ∂ log(Λ) for each i are negligible in comparison to the other terms, so they can be ignored at high Λ, and the RG equation is Integrating it, we have where p 2 is the constant of integration and this form will be useful for comparing to experiment. We verify above RG equation against the log-log variations in experimental value(s) of λ m at different cutoff scales Λ. For this, (3.26) is used to obtain the rank-4 tensor λ m at each choice of cutoff scale, given in (3.31). All independent tensor elements of λ m are plotted below against respective Λ, in a log-log format, at different values of d in . Fig. 6 shows that λ is a true coupling constant, in agreement with the translation invariance property of Gauss-net kernel. Further, the best fit line of Fig. 6 is in excellent agreement with the theoretical EFT prediction at (4.31); the slope is −d in to excellent approximation.

ReLU-net
Next we obtain the RG equation for ReLU-net architecture in the limit where κ becomes negligible. Recall that the ReLU kernel is given by .
The model-dependent terms have been further separated into a bounded θ-dependent component h 2 and an unbounded component h 1 .
The 4-pt function receives a large contribution from the λ-corrections when and x is one of the inputs in our experiments; in this limit h 1 (x, x ) can be approximated as the following 33) and the 4-pt function becomes The integrals are carried out by rotating the input-space to the spherical coordinates, where |x| and θ coordinates become orthogonal. Therefore, the θ-integrals can be independently carried out as where c 1 , c 2 , c 3 and c 4 have scaling dimensions 0 in the input space. This shows that h 2 (θ) also has scaling dimension 0 in the input space.
Combining this with (4.33), the scaling dimension of ReLU kernel is given by and x is one of the inputs in our experiments. Following (4.8), the 4-pt and 6-pt couplings scale as [λ] = −d in − 4 and [κ] = −d in − 6 respectively. This implies that κ becomes increasingly more negligible with respect to λ as Λ becomes larger. In particular, this implies that the RG equation for λ can be entirely obtained in terms of O(λ) corrections to the 4-pt function; i.e., the κ corrections may be ignored.
Evaluating the full 4-pt function at O(λ) in large cutoff limit, we obtain where j 1 , j 2 , j 3 , j 4 , j 5 are obtained by redefining the products of external vertices with other constants. A trivial check shows that they are independent on Λ scaling as well. Taking derivatives, we can obtain the RG equation as the following . Solving this we obtain   The best fit lines in Fig. 7 are in excellent agreement with the theoretical prediction at (4.38); indeed, the slopes are ∼ −(d in + 4) to excellent approximation. Here λ appears to be a constant, but there is a small amount of variance on the same order displayed in Figure 3 that is not visible due to the scale of the y-axis.

Conclusions
In this paper we have developed a correspondence between neural networks and quantum field theory, providing a theoretical understanding for a function-space approach to neural networks.
The central idea is to treat neural network architectures f θ,N that become Gaussian processes (GPs) in the N → ∞ limit using techniques from Wilsonian effective field theory (EFT). Specifically, at finite N the distribution on function space from which the neural networks are drawn is no longer Gaussian; the GP distribution is corrected to an associated non-Gaussian process (NGP) distribution that may be treated using EFT. The corrections have coefficients known as couplings, in physics, that encode interaction strengths. They may be functions or constants, which we understood in terms of 't Hooft's notion of technical naturalness.
Overall, we focused our developments on the two central ideas in EFT, applied in the context of neural networks: • EFT is effective. A finite number of measured couplings yield many verifiable predictions.
• EFT yields minimal NGP likelihoods. Couplings vary with cutoff according to differential equations that govern flow on coupling space. Along the flow, some couplings become negligible, yielding a likelihood where all but the most important pieces can be safely ignored.
These were verified in concrete experiments in the simplest class of models admitting a GP limit: single layer fully-connected networks. We strongly emphasize, however, that our techniques are applicable to any architecture admitting a GP limit. Since we provided a concrete summary of our results in the introduction, we omit such a summary here.
Instead, we would like to conclude with some general comments and outlook. Physicist readers may not be familiar with thinking of neural networks from a function space point of view, despite the fact that they think this way in QFT. However, many have trained a randomly initialized neural network, and then repeated it a number (say, M ) of times to verify stability of the results. In doing so one is drawing M untrained neural networks from the prior induced by the parameter distributions and network architecture, and training the networks turns them into draws from some other distribution, the trained distribution. From this perspective, the process of supervised learning is simply learning the one-point function of the trained distribution; given an ensemble of trained nets, one should generate predictions from the ensemble expectation value of the outputs, the one-point function. Due to the prevalence of GP limits for many different architectures, in both randomly initialized and appropriately trained networks, the desired distributions on function space are nearly Gaussian, and therefore amenable to QFT techniques.
In the ML literature, it was (borrowing the inspirational language) "dreamed" in [31] that flows in distributions of preactivations are akin to renormalization group flows; we would like to comment on this in light of our work. We did not emphasize it so as to not introduce too many ideas, but our techniques apply not only to GPs associated with network outputs, but also those that may arise at intermediate layers. For instance, in deep fully-connected networks the central limit theorem may be applied [3] not just to the network outputs, but also the preactivations of the hidden layer, leading to an EFT description of the NGP associated with each layer's preactivation. However, the details of the NGP effective action associated to each layer depends on both the previous layers' activation functions and the width of that layer; i.e., the number of preactivations, which is the dimension of the GP distribution or number of quantum fields in the correspondence. This gives rise to structural differences in the effective actions that go beyond a simple flow in the couplings, so in general the relationship between preactivations flows and RG flows is (again borrowing language) a "spiritual resemblance," not an exact correspondence. Choosing all hidden layers to have the same non-linear activation function and width may make the analogy more direct.
Continuing down this direction, at a given layer of width N one could write the NGP effective action (log-likelihood) such that the N preactivations depend on the preactivations of a previous layer of width M , rather than the d in inputs to the overall network. In this case it would naively appear that the EFT for the N preactivations would be in M -dimensional space, rather than d indimensional space. These would be two EFT descriptions of the same preactivation NGP, but in different dimensions. What seems like a puzzle here (the arbitrariness of the dimensions) is likely resolved by a careful consideration of dimensionality, namely the first preactivation distribution having d in -dimensional support in the M -dimensional space.
Equipped with a sharp NN-QFT correspondence, it is natural to try to import other ideas from QFT into the study of neural networks from a function space perspective. Perhaps those, together with the ideas implemented in this paper, can lead to a richer theoretical understanding of the many empirical successes in deep learning over the last decade.

Single and Multivariable Cases
All Gaussian integrals can be evaluated exactly as The associated 2n-pt function is the expectation value of the single variable operator x 2n ; it can be obtained by repeatedly differentiating the above by −2(d/da) to give The (2n − 1)!! factor in (A.2) can be thought of as the number of ways to connect 2n points in pairs, which is a version of what is known as Wick contraction in the physics literature. Using this, x 2n can be computed graphically in terms of Feynman diagrams. Let us compute the 4-pt function: Four vertices can be Wick contracted into two pairs in three distinct ways. Each connected pair contributes a factor of 1/a to the Gaussian integral in (A.2). The corresponding Feynman diagram, given below, is a sum of three distinct diagrams, each corresponding to a unique Wick contraction, and contributing a factor of 1/a 2 from the two pairs. The total of three diagrams, 3/a 2 , matches with (A.2) for n = 2.
More generally, a Gaussian integral including a source term, described as the one below, The 2n-pt function x 2n can be alternatively calculated by acting with the derivative δ/δJ on (A.3) 2n times, then setting the source J = 0 to retrieve the Gaussian integral.
We can generalize (A.4) to a multivariate Gaussian integral by promoting a to a real symmetric N × N matrix A ij and variable x to a (N × 1) dimensional vector x i , given below The general expression of the n-pt function can be obtained by applying Wick contraction to (A.6), to give For n = 1 and n = 2 these simplify into

Continuous Number of Variables
When the variables describing a Gaussian process are promoted to be continuous, the entire process can be defined in terms of a partition function, given by and Z GP,0 := Z GP [J = 0] = df e −S GP is the normalization constant.
Ξ(x, y), a real symmetric 2-tensor, is the continuous version of the inverse of the covariance matrix. We can evaluate (A.11) by a basis transformation such that Ξ(x, y) is diagonal. However, here we choose a more direct evaluation of Z GP by noting that it can be expressed as the following The terms in the last exponential of (A.12) can be expanded to obtain and a similar identity for J(x)f (x). The remaining term in the last exponential of (A.12) is simplified to be The last identity arises from the δ (d) (w − y) resulting from x-integral evaluation, followed by renaming integration variables to x and y. Using these, the integral over output space on in (A.12) can be simplified. Further, by diagonalization Z GP,0 = [2π/ det (Ξ)] 1/2 . Putting it all together, we have the integrated form of the partition function where K(x, y) is the functional / operator inverse of Ξ(x, y), i.e. dy K(x, y)Ξ(y, z) = δ(x − z). The n-pt function of the Gaussian process over continuous variables is given by It may be computed in a canonical way by taking functional J-derivatives, which can be directly evaluated from (A. 16).
For the sake of thoroughness we demonstrate the calculations of some of the n-pt functions. The 1-pt function is where the derivative results in a factor of J, which sets the 1-pt function to 0. The 2-pt function is given by The last line follows from the symmetry of the covariance K(x 1 , x 2 ). The general expression of the n-pt function can be obtained by a similar calculation; it is given by From (A.21) we deduce that the 3-pt function of a Gaussian process is identically zero. And the 4-pt function is given by which may be expressed in terms of Feynman diagrams, as done in (2.17).

Non-Gaussian Integrals via Perturbation Theory
Small perturbations away from the Gaussian process can still be understood in terms of orderby-order perturbative corrections to the Gaussian integral. Let us assume that the small perturbations can be described in terms of correction terms g n f (x) n to S GP . The associated new partition function is The numerator and the denominator in (A.25) can be separately obtained using Wick contractions and Feynman diagrams; they both contain some diagrams where none of the internal vertices are connected to any of the external vertices. Such diagrams are called "vacuum bubbles", and any diagram containing a vacuum bubble exactly cancels from the numerator and denominator upon series expansions, thus not contributing to the actual n-pt functions. We take an example where the non-Gaussian process is described by small perturbations away from the GP by ∆S = d d x [gf (x) 3 + λf (x) 4 ]. The denominator of (A.25) is expanded below, in terms of Feynman diagrams, up to quadratic order terms in coupling constants g and λ.
1 − λ 3 y + g 2 2! 9 w w + 6 w w + λ 2 2! 9 y y + 72 y y + 24 y y . (A.26) The three point vertices correspond to coupling constant g and occur at internal points labeled by w, and the four point vertices correspond to λ and occur at points labeled by y.
The numerator of the 1-point function, up to same orders of correction terms, is given below: Similarly, the numerator of the 2-pt function is obtained to be x 1 x 2 − λ 3 y x 1 x 2 + 12 y x 1 x 2 + g 2 2! 9 w w x 1 x 2 + 6 w w Each of the higher order n-pt functions contains different topologically distinct Feynman diagrams, with various permutations of external vertices; each of these diagrams should be treated separately when external vertices are fixed. However, for illustration, we present such topologically distinct Feynman diagrams ignoring all external labels, and take care of the combinatorics of external vertices by multiplying with the total number of copies.
The numerator of the 3-pt function is given by the following diagrams: A neural network with activation function φ(x) and one hidden layer has the following output function When the weights and biases W 0 , W 1 , b 0 , b 1 are i.i.d. and drawn from a Gaussian distribution with mean 0 and standard deviations σ w 0 , σ w 1 , σ b 0 , σ b 1 respectively, the kernel or 2-pt function is given by The 2-pt function of the post-activation V φ [φ(W 0 x + b 0 ), φ(W 0 x + b 0 )] can be evaluated by two methods: the first method is exact at any width, prescribed in [2]; and the second method is true in the GP limit, described in [33]. We will refer to the two methods by superscripts "Williams" and "Yang" respectively. They are the following where K is the kernel or covariance function of the Gaussian distribution of f in the GP limit. For the three NN architectures discussed in this paper, kernels evaluated by both prescriptions are shown to agree in the limit of infinite width, i.e. GP.

Erf-net
We now turn to the case of Erf activation, which is given by At any width, the associated kernel can be obtained by the method in [2]. This involves substituting (B.5) in (B.3), followed by computing the Gaussian integral and a transformation of variables. The kernel of Erf-net by this method is obtained to be (1 + 2K(x, x)) (1 + 2K(x , x )) . (B.6) The intermediate expressions, such as K(x, x ), are the kernels of the linear layer W 0 x + b 0 , given below K(x, x) = σ 2 b 0 + σ 2 w 0 x · x , K(x , x ) = σ 2 b 0 + σ 2 w 0 x · x , K(x, x ) = σ 2 b 0 + σ 2 w 0 x · x . (B.7) In the limit of infinite width, the method in [33] can be used. This involves substituting (B.5) in (B.4), defined in terms of PDF of output f (x). Evaluating the Gaussian integral, followed by a change of variables, results in the same expressions as (B.6) for V Yang Erf-net (x, x ). This shows that the Erf-net kernel is exact at any width, given by . (B.8)

ReLU-net
Next we study the case of ReLU activation, which is given by At finite width, (B.9) can be substituted into (B.3), followed by a rearrangement in terms of Heaviside function Θ(x) = 1 2 (1 + sgn(x)), to obtain The arguments of the Heaviside function are chosen to lie in the first quadrant. After a basis transformation it results in the following where K(x, x), K(x, x ) and K(x , x ) are the intermediate kernels for linear layer W 0 x + b 0 , as defined in (B.7). In the infinite width limit, the kernel or 2-pt function of the exponential activation layer is obtained using the PDF of output of this layer, following [33], as with J(y) = [δ(y − x) + δ(y − x )]. Field theory methods, described in the previous section, can be used to simplify this and obtain We can see that the kernel of exponential activation layer K exp (x, x ) is exact at any width, this in turn causes the Gauss-net kernel to be exact at any width. The final expression of Gauss-net kernel is given by which is translation invariant.