Neural Network Field Theories: Non-Gaussianity, Actions, and Locality

Both the path integral measure in field theory and ensembles of neural networks describe distributions over functions. When the central limit theorem can be applied in the infinite-width (infinite-$N$) limit, the ensemble of networks corresponds to a free field theory. Although an expansion in $1/N$ corresponds to interactions in the field theory, others, such as in a small breaking of the statistical independence of network parameters, can also lead to interacting theories. These other expansions can be advantageous over the $1/N$-expansion, for example by improved behavior with respect to the universal approximation theorem. Given the connected correlators of a field theory, one can systematically reconstruct the action order-by-order in the expansion parameter, using a new Feynman diagram prescription whose vertices are the connected correlators. This method is motivated by the Edgeworth expansion and allows one to derive actions for neural network field theories. Conversely, the correspondence allows one to engineer architectures realizing a given field theory by representing action deformations as deformations of neural network parameter densities. As an example, $\phi^4$ theory is realized as an infinite-$N$ neural network field theory.

Figure 1.In a NN-FT correspondence, ideas from one may give insights into the other.In this paper we are primarily interested in understanding when NN FT exhibit physical principles such as non-Gaussianity and locality, with an eye towards applications in both ML and especially physics in the future.increasing complexity of deep neural networks (NN), both in terms of the number of parameters appearing in them and their architecture.However, despite their empirical success, the theoretical foundations of deep NN are still not fully understood.Natural questions emerge: • Are ideas from the sciences, such as physics, useful in NN theory?• As it develops, does ML theory lead to progress in the sciences?
A growing literature (see below), gives an affirmative answer to the first, but the second is less clear; it is applied ML, not theoretical ML, that is primarily used in the sciences.
In this paper we explore both of these questions by further developing a correspondence between NN and field theory (FT).This connection was already implicit in Neal's PhD thesis [5] in the 1990's, where he demonstrated that an infinite width single-layer NN is (under appropriate assumptions) a draw from a Gaussian process (GP).This is the so-called NNGP correspondence, and in recent years it has been shown that most modern NN architectures [6,7] have a parameter N such that the NN is drawn from a GP in the N → ∞ limit.The NNGP correspondence is of interest from a physics perspective because GPs are generalized non-interacting (free) FT, and NN provide a novel way to realize them.Non-Gaussianities emerge at finite-N, which correspond to turning on interactions that are generally non-local, and may be captured by statistical cumulant functions, known as connected correlators in physics.As we will see, since Gaussianity in the N → ∞ limit emerges by the central limit theorem (CLT), non-Gaussianities may be studied more generally by parametrically violating necessary conditions of the CLT.
These results provide a first glimpse that there is a more general NN-FT correspondence that should be developed in its own right, taking inspiration from both physics and ML.In this introduction we will review the central ideas of the correspondence and introduce principles for understanding the literature, which we review in part.Readers familiar with the background are directed to section 1.3 for a summary of our results.

NN-FT correspondence
At first glance, NN and FT seem very different from one another.However, in both cases, the central objects of study are random functions.The random function ϕ associated to a NN is defined by its architecture, which is a composition of simpler functions that involves parameters θ.At program initialization, parameters are drawn as θ ∼ P(θ), yielding a randomly initialized NN, i.e. a random function.In FT, the random functions are simply the fields themselves, typically described by specifying their probability density function directly, P(ϕ) = exp(−S[ϕ]), via the Euclidean action functional S[ϕ]; we work in Euclidean signature throughout.
We therefore have two different origins for the statistics of a FT, shown in figure 1.To exemplify the point, consider a FT defined by an ensemble of networks or fields ϕ : R → R, where σ : R → R acts element-wise and is generally taken to be non-linear.Here the statistics of the ensemble arise from how it is constructed, rather than from the density exp(−S[ϕ]) over functions from which it is drawn.We will refer to such a description as the parameter space description of a NN FT.The construction of ϕ defined in (1.1) has two parts, the architecture that defines its functional form, and the choice of distributions from which the parameters a, b, and c are drawn.This particular architecture is a feedforward network with depth two, width one, and activation function σ.In this description of the FT, one does not necessarily know the action S[ϕ], but the theory may nevertheless be studied because the architecture and parameter densities define its statistics.
For instance, the correlation functions of a NN FT can be expressed as where we denote the set of parameters of the NN by θ, and the network / field ϕ depends on parameters through its architecture.Alternatively, we could provide a function space description of the theory by specifying the action S[ϕ] and express the correlation functions as [ϕ] ϕ(x 1 ) . . .ϕ(x n ), (1.3) as in a first course on quantum FT (QFT).These expressions may be derived from the partition function where the parameter space and function space results arise by specifying how the expectation value is computed, ´dd xJ(x)ϕ(x) (1.5)

Z[J]
= ˆDϕ e −S[ϕ]+ ´dd xJ(x)ϕ(x) . (1.6) In this work, many calculations will be carried out in terms of a general expectation value E[•] that denotes agnosticism towards the origin of the statistics; explicit calculations may be carried out by replacing E with one description or the other, as in passing from a general expression (1.4) to those of parameter space (1.5) and function space (1.6).
Parameter space and function space provide two different descriptions of a FT, which could be thought of as different duality frames [8].When one defines a FT by a NN architecture, the parameter space description is readily available, but the action is not known, a priori.However, if the parameter distributions are easy to sample then the fields are also easy to sample: one just initializes NN on the computer.On the other hand, in FT we normally proceed by first specifying an action; in this case, the probability of a given field configuration is known because P[ϕ] = exp(−S[ϕ]) is known, but fields are notoriously hard to sample, as evidenced by the proliferation of Monte Carlo techniques in lattice FT.

Example: NNGP correspondence in parameter space and function space
Let us study an example to make the abstract notions more concrete.Consider a fully-connected feedforward network ϕ : R d → R with depth one and width N, where σ is an elementwise non-linearity such as tanh or ReLU(z) := max(0, z).Here, the set of parameters θ is given by the union of the a-parameters and the b-parameters.As we will see in detail in section 2, if the parameters are drawn independently then the connected correlation functions and the odd-point correlation functions vanish due to a having zero mean.In the N → ∞ limit, also known as the GP limit, then, the only non-vanishing connected correlator has two points, G (2)  c (x 1 , x 2 ) (1.9) which demonstrates that the theory is Gaussian; this is the NNGP correspondence.Concretely, following [9], we may compute the two-point function as G (2) (x, y) = E[ϕ(x)ϕ(y)] = ˆda db P(a)P(b) a i1 σ(b i1j1 x j1 ) a i2 σ(b i2j2 y j2 ) (1.10) where we have used Einstein summation and left the details of the Gaussian parameter densities P(a) and P(b) implicit.For a fixed choice of σ one may evaluate this integral analytically or via Monte Carlo sampling, resulting in the two-point function; analytic integrated results for σ = tanh and σ = Erf are presented in [9].Since the parameter space calculation establishes Gaussianity of the theory, we infer the action S[ϕ] = ˆdd x d d y ϕ(x) G (2) (x, y) −1 ϕ(y), (1.11) where the inverse of the two-point function satisfies ´dd y G (2) (x, y) −1 G (2) (y, z) = δ (d) (x − z).
As a concrete example, we refer the reader to section 4.2, which recalls a NN realization of free scalar FT from [10] that uses a cos activation.In that case we have G (2) (x, y) −1 = δ(x − y)(∇ 2 + m 2 ) (1.12) which reproduces the usual free scalar action (1.13) in this case realized via a concrete NN architecture.Thus, in the GP limit, both the parameter space and function space descriptions of the FT are readily available.Building on [11] using the Edgeworth expansion, we will see methods for computing approximate actions at finite-N, and we will also develop techniques to engineer desired actions.

Organizing principles and related work
We have discussed a foundational principle underlying the NN-FT correspondence, that parameter space and function space provide two different descriptions of the statistics of an ensemble of NN or fields.Though we have given an example, and there are many more, we are still in very general territory and it is not clear where to go.Accordingly, we would like to provide other organizing principles: • NN-for-FT vs. FT-for-NN: are we aiming to better understand physics or ML?
• Fixed Initialization vs. Learning: are we aiming to understand a fixed NN-FT at initialization, or a oneparameter family of NN-FTs defined by some dynamics, such as ML training dynamics or FT flows?
Much of the existing literature can be classified within each of these principles, and they also set context for discussing our results.We will first review some results for network ensembles at initialization, and then during and after training.With these ideas in place, we will turn to the idea of using NN-FT in service of FT.For literature that is most similar in perspective to this introduction (prior to this reference section), see [10] and the works that preceded it [8,12], by subsets of the authors.

Initialization
A NN with parameters θ and parameter distribution P(θ) is initialized on a computer by drawing θ ∼ P(θ) and inserting them into the architecture, generating a random function ϕ(x) that is sampled from a distribution P(ϕ) that may or may not be known.In the N → ∞ NNGP limit, P(ϕ) is Gaussian.This was shown for feed forward networks in Neal's thesis [5], as well as more recently in [6]; was generalized to a plethora of architectures, e.g.convolutional layers [7,13,14], recurrent layers, graph convolutions [15], skip connections [16], attention [17], and batch /layer normalization in [18], pooling [14], and transformers [19,20].The generality of this result arises from the generality in which central limit theorem behavior manifests itself in NN; see [7] for a systematic treatment in the tensor programs formalism.
Since Gaussianity follows from the central limit theorem, one generally expects non-Gaussianities in the form of 1/N-corrections.Study of these non-Gaussianities was initiated a few years ago; e.g.[21] computed leading non-Gaussianities via the connected four-point function, [22] showed for deep feedforward networks how P(ϕ) is perturbed by 1/N-corrections, [12] proposed using effective FT to model non-Gaussian P(ϕ) for NN, and [23] developed an effective theory approach and an L/N expansion that controls feature learning in deep feedforward networks; for concreteness in our examples, we are interested in the distribution of networks at initialization and take L = 1.This L/N expansion allowed [23] to also study signal propagation through the network, identify universality classes, and tune hyperparameters to criticality.
Methods borrowed from FT have been useful in studying NNs at initialization.For example, perturbative methods like Feynman diagrams were employed in [12,24,25].Various schemes for renormalization group flow, including non-perturbative ones, were applied to NNs in [26].Global symmetries of NN-FTs were shown to arise from symmetry invariances of NN parameter distributions in [8].While the results of this paper were being finalized, a recent paper [27] brought forward a different diagrammatic approach to effective FT in deep feedforward networks.

Learning
Although we do not study the dynamics of learning in this paper, it is a goal for future work.Therefore, we would like to review some of the literature.
NN may be trained to perform useful tasks via a variety of learning schemes, such as supervised learning or reinforcement learning, that utilizes a learning algorithm to update the system, such as stochastic gradient descent.In practice this involves training one or a handful of randomly initialized NN to convergence.However, in general there is nothing special about the initial networks that were trained; in the absence of compute limitations, one would prefer to train all the networks and compute an ensemble average at convergence.Theoretically, this amounts to tracking the distributional flow of the NN ensemble, and in principle it may be done in either parameter space or function space.
In the N → ∞ limit, most known architectures define NN that are draws from GPs.Since the architecture defines a GP, it could be used as a prior in Bayesian inference, the learning algorithm of interest in Neal's original work [5].On the other hand, gradient descent with continuous time is governed by the neural tangent kernel (NTK) [28], which becomes deterministic and training time t-independent in the so-called frozen-NTK limit.In this limit, N → ∞ and the NN dynamics is well-approximated by that of a model that is linear in the NN parameters.This frozen behavior is a vast simplification of the dynamics and is known to exist for many architectures, such as convolutional NN [29], graph NN [30], recurrent networks [31], and attention layers [19].For supervised learning with MSE loss, the NN ensemble trained under gradient descent remains a GP for all times t, including t → ∞, with known mean and covariance; the dynamics becomes that of kernel regression, with kernel given by the frozen-NTK.How is this related to Neal's desire to relate Bayesian inference and trained NN?If all but the last layer's weights are frozen, then the NTK is the NNGP kernel and the distribution of the NN ensemble converges to the GP Bayesian posterior as t → ∞.
In summary, in the N → ∞ limit, the distribution of the NN ensemble is Gaussian.If it undergoes supervised training with MSE loss, it remains Gaussian at all times and converges to the Bayesian GP posterior in a particular case [32].In general, however, gradient descent induces non-Gaussianities.
At finite-N, the NN ensemble is non-Gaussian.In the Bayesian context, this defines a non-Gaussian prior, and inference may be performed for weakly non-Gaussian priors via a 1/N-expansion [21].In the gradient descent context, the NTK is no longer frozen and evolves during training, significantly complicating the dynamics.Work by Roberts, Yaida, and Hanin develops a theory of an evolving NTK in [23].They apply it in detail to fully-connected networks of depth L, demonstrate the relevance of L/N as an expansion parameter, and develop an effective model for the dynamics.Such 1/N corrections to dynamical NTK were previously studied by other authors in [24,33].Bordelon and Pehlevan have developed a systematic understanding of the evolution of NTK and parametric interpolations between rich and lazy training regimes using the framework of dynamical mean FT, see [34].Some of these authors have studied the O(1/N) suppressed corrections to training dynamics of finite width Bayesian NNs in [35].A separate work, [36], presents close-to-Gaussian NN processes including stationary Bayesian posteriors in the joint limit of large width and large data set, using 1/N as an expansion parameter.Moreover, the authors of [37] explore a correspondence between learning dynamics in the continuous time limit and early Universe cosmology, and [38] analyzes connected correlation functions propagating through NN.

NN-for-FT
NN, including the ones we have discussed thus far, generally have R n as their domain and therefore naturally live in Euclidean signature.They define statistical FT that may or may not have analytic continuations to quantum field theories in Lorentzian signature.Nevertheless, statistical FT are interesting in their own right and NN-FT provides a novel way to study them.
Using an architecture to define a FT enables a parameter space description that makes sampling, and therefore numerical simulation on a lattice, easy.If one can determine an easily sampled NN architecture that engineers standard Euclidean ϕ 4 theory, for instance, this could lead to improved results on the lattice by avoiding Monte Carlo entirely 6 .This is an engineering problem that is work-in-progress; it is not clear that the ϕ 4 NN-FT realization in this work is easily sampled.Alternatively, by simply fixing an easily sampled architecture with interesting physical properties such as symmetries and strong coupling, lattice simulation could be performed immediately.
For uses in fundamental and formal quantum physics, one might wish to know when a NN architecture defines a QFT.Since NN architectures are usually defined in Euclidean signature, we may instead ask when a Euclidean FT admits an analytic continuation to Lorentzian signature that defines a QFT.The situation is complicated by the fact that in general we do not know the action, but instead have access to the Euclidean correlation functions, expressed in parameter space.
Fortunately, the Osterwalder-Schrader (OS) theorem [40] of axiomatic FT gives necessary and sufficient conditions, expressed in terms of the correlators, for the existence of a QFT after continuation.The axioms include • Euclidean Invariance.Correlation functions must be Euclidean invariant, which becomes Lorentz invariance after analytic continuation.See [10] for an infinite ensemble of NN architectures realizing Euclidean invariance.• Permutation Symmetry.Correlation functions must be invariant under permutations of their arguments, a collection of points in Euclidean space.This is automatic in NN-FTs with scalar outputs.• Reflection Positivity.Correlation functions must satisfy a positivity condition known as reflection positivity, which is necessary for unitarity and the absence of negative-norm states in the analytically continued theory.• Cluster Decomposition.Correlation functions must satisfy cluster decomposition, which says that interactions must shut off at infinite distance.As a condition on connected correlators, cluster decomposition is for any value of 1 < p < n.We have assumed permutation symmetry to simplify notation, putting the shifts into x p+1 into x n .
These ideas were utilized in [10] to define NN quantum FT: a NN-QFT is a NN architecture whose correlation functions satisfy the OS axioms, and therefore defines a QFT upon analytic continuation.To date, the only known example is a NN architecture that engineers a standard free scalar FT in d-dimensions, though we improve the situation in this work by developing techniques to engineer local Lagrangians, which automatically satisfy the OS axioms.To make further progress on NN-QFT in a general setting, one needs especially a deeper understanding of reflection positivity and cluster decomposition in interacting NN-FTs; we study the latter.

Summary of results and paper organization
Since there are a number of different themes and concepts in this paper, we would like to highlight some of the major conceptual results: • Parametric Non-Gaussianity: 1/N and Independence Breaking.section 2 approaches interactions in NN-FT (non-Gaussianity) by parametrically breaking necessary conditions for the central limit theorem to hold.Violating the infinite-N limit is well studied, but we also systematically study interactions arising from the breaking of statistical independence, and apply these ideas in examples.

• Computing Actions with Feynman Diagrams.
In section 3 we develop a general FT technique for computing the action diagrammatically.The coupling functions are computed with a new type of connected Feynman diagram, whose vertices are the connected correlators.This is a swapping of the normal role of couplings and connected correlators, which arises from a 'duality' that becomes apparent via the Edgeworth expansion.The technique is also applied to NN-FT, including an analysis of how actions may be computed in the two regimes of parameteric non-Gaussianity developed in section 2, 1/N and independence breaking.• Engineering Actions in NN-FT.
In section 4 we develop techniques for engineering actions in NN-FT.This is to be distinguished from the approach of section 3: instead of fixing an architecture, computing its correlators, and then computing its action via Feynman diagrams, in section 4 we fix a desired action and develop techniques for designing architectures that realize the action.Adding a desired term to the action manifests itself in NN-FT by deforming the parameter distribution, which breaks statistical independence if it is a non-Gaussianity.Using this technique, local actions may be engineered at infinite-N.• ϕ 4 as a NN-FT.
In section 4.2 we design an infinite width NN architecture that realizes ϕ 4 theory, using the techniques that we developed.• The Importance of N → ∞ for Interacting Theories.In physics, interesting theories defined by a fixed action S generally have a wide variety of finite action field configurations, which have non-zero probability density.This is potentially at odds with the universal approximation theorem: if a single finite-action configuration cannot be realized by an architecture A, but only approximated, then any NN-FT associated to A cannot realize the FT associated to S. If the 1/N is an expansion parameter for both non-Gaussianities and the degree of approximation, as e.g. with single-layer width-N networks, this simple no-go theorem suggests that exact NN-FT engineering of well-studied theories in physics occurs most naturally at infinite-N, as we saw in the case of ϕ 4 theory.
These are highlights of the paper.For more detailed summaries of results, we direct you to the beginning of each section.

Connected correlators and the central limit theorem
Interacting FT with a Lagrangian description are defined by non-Gaussian field densities exp(−S[ϕ]).If the non-Gaussianities are small, the theory is close to Gaussian and weakly interacting, in which case correlation functions may be computed in perturbation theory using Feynman diagrams.The non-Gaussianities are captured by the higher connected correlation functions, which vanish in the Gaussian limit.They are known as cumulants in the statistics literature and may be obtained from a generating functions W[J] as In the absence of a known Lagrangian description, connected correlators still encode the presence of non-Gaussianities, since the theory is Gaussian if In this section we systematically study non-Gaussianities in NN-FT.Since the parameter space description exists for any NN-FT, we choose to study non-Gaussianities via connected correlators (rather than actions), which may be studied in parameter space even when the action is unknown.We are interested in non-Gaussianities in NN-FT for a number of reasons.In the NN-for-FT direction, it is important for understanding interactions in the associated FT.Conversely, in the FT-for-NN direction, understanding non-Gaussianities is important for capturing the statistics of finite networks and networks with correlations in the parameter distributions, which generally develop during training.
The essential idea in our approach is to recall the origin of Gaussianity, and then parametrically move away from it.Specifically, many FT defined by NN architectures admit an N → ∞ limit in which they are Gaussian, and the Gaussianity has a statistical origin: the central limit theorem (CLT).The CLT states that the distribution of the standardized sum of N independent and identically distributed random variables approaches a Gaussian distribution in the limit N → ∞.Therefore we may systematically study non-Gaussianities in NN FT by violating assumptions of the CLT, e.g.via 1/N corrections and breaking the independence condition, both of which affect connected correlators.
There are a number of results and themes in this section, which is organized as follows: • CLT.In section 2.1 we review the CLT from the perspective of cumulant generating functionals, which will be useful in NN-FT since in general we do not have a simple expression for the action but do have access to cumulants.• Independence Breaking.In section 2.2 we introduce how non-Gaussianities may also arise by violating the statistical independence assumption of the CLT.We characterize this by a family of joint densities with parameter α that factorize (become independent) when α = 0. We study the α-dependence of cumulants via Taylor series, showing that α controls non-Gaussianities independently of those arising from 1/Ncorrections.A simple example of independence-breaking induced non-Gaussianities at N = ∞ is given in section 2.2.1.• Connected Correlators and Interactions in NN-FT.In section 2.3 we study non-Gaussianities in NN-FT, decomposing the field ϕ(x) into N constituent neurons as in [10].We study the case of independent neurons in section 2.3.1,where we present the N-scaling of connected correlators and also two examples: single-layer Cos-net, which exhibits full Euclidean symmetry in all of its correlators, and d = 1 ReLU-net, which we show exhibits an interesting bilocal structure in its two-point and four-point functions.In section 2.3.2 we turn to breaking neuron independence in NN-FT, building on the independence breaking results of [10], which gives a new source of interactions and a generalized formula for connected correlators.Specifically, we introduce a general formalism for the expansion of the cumulant generating functional in terms of independence-breaking parameters, and therefore the computation of connected correlators.As an example, we deform the Cos-net theory to have non-independent neurons via non-independent input weights, doing the deformation in a way that preserves Euclidean invariance, and compute the independence-breaking correction to the connected four-point function.
• Identical-ness Breaking.Interactions may also arise from breaking the identical-ness assumption of the CLT.See appendix B for an example of a NN-FT with non-Gaussianities arising from identical-ness breaking.
Equipped with two different types of parameters that induce non-Gaussianity in connected correlators, 1/N and independence-breaking parameters, we will see how this may be used to approximate actions in section 3.

Review: CLT from generating functions
In order to understand non-Gaussianities in NN-FTs, it is useful to recall essential aspects of the CLT in the case of a single random variable, since they carry over to the NN-FT case.We will do so using the language of generating functions and cumulants (connected correlators), since we may use them to study Gaussianity and non-Gaussianity even if the NN-FT action is unknown.Of course, the CLT is among the most fundamental theorems of statistics.There are many variants of it in the literature, with different sets of assumptions.Here, we will describe a particularly simple version of it and provide a proof, showing how key assumptions come into play.For a more in depth discussion of the CLT, see e.g.[41].
Consider N random variables X i .Assume that they are identical, independent, mean-free, and have finite variance.The CLT states that the standardized sum is drawn from a Gaussian distribution in the limit N → ∞.In other words, even if the X i are sampled from complicated, non-Gaussian distributions, these details wash out and their sum is drawn from a Gaussian distribution.
To see the Gaussianity in a way that may be extrapolated to NN-FT, it is useful to introduce generating functions.The moment generating function of ϕ is defined as from which we can extract the moments by taking derivatives, In physics language, J is the source, Z ϕ [J] is the partition function, and µ ϕ r is the rth correlator of ϕ.The cumulant generating functional (CGF) of ϕ is the logarithm of the moment generating functional and the cumulants κ ϕ r are computed by taking derivatives of W ϕ [J], A random variable is Gaussian only if its cumulants κ ϕ r>2 vanish.Fundamental properties of CGFs include ) where c ∈ R is a constant, which imply (2.9) respectively.
We would like to see the Gaussianity of ϕ under CLT assumptions by computing cumulants.This is possible since κ r>2 = 0 is necessary for Gaussianity; conversely, we may study non-Gaussianities in terms of non-vanishing higher cumulants.Specifically, for a sum of independent random variables the moment generating function factorizes, (2.11) Consequently, the CGF and the cumulants become ) Using the identities in (2.10) we can write the cumulants of ϕ as When the X i are identical this simplifies to The cumulants κ ϕ r>2 vanish in the N → ∞ limit.To establish that ϕ is Gaussian, we also need to show that κ ϕ 1 and κ ϕ 2 are finite.As the X i are mean-free, 2 is finite by assumption.Thus, ϕ is Gaussian distributed.This is the CLT, cast into the language of cumulants.
We emphasize that this result relies not only on the N → ∞ limit, but also on the independence assumption (2.13).

Non-Gaussianity from independence breaking
We wish to study the emergence of non-Gaussianity by breaking the independence condition.
To do so, we must parameterize the breaking of statistical independence.Let p(X; α) be a family of joint distributions on X i parameterized by a hyperparameter α that must be chosen in order to define the problem.We choose the family of joint distributions to be of the form i.e. p(X) is independent in the α → 0 limit, but α ̸ = 0 in general controls the breaking of independence.Then we obtain which when expanded around α = 0 yields where the first term of the log uses independence of p(X; α = 0).
To deal with the α-dependent terms, we generalize a trick appearing regularly in ML, e.g. in the policy gradient theorem in reinforcement learning.There, the fact that p ∂ α log p = ∂ α p allows us to write for any α-independent operator O. Generalizing, we define and note that it satisfies the recursion relation which allows for efficient computation.We can then write (2.18) as In the limit α → 0, the X j become independent, and we have where ϕ is now a sum of N independent variables X j , and its CGF is the sum of CGFs of X j / √ N, as expected; details of the calculations are in appendix C.
We have now discussed two mechanisms that result in non-Gaussianities: 1/N corrections and independence breaking.While one can use either or both of these mechanisms to generate and control non-Gaussianities, more caution is required to use independence breaking alone, at infinite N.This is because the non-Gaussianities that are generated by independence breaking might depend on N as well as α.For example, if the leading corrections to higher cumulants κ ϕ r scale as αN ar with a r < 0 for all r > 2, ϕ will be Gaussian regardless of independence breaking.While if a r > 0, κ ϕ r will diverge, which is undesirable.In the following, we will present an example where a r = 0 for all r and the non-Gaussianities are generated by independence breaking alone.

Example: independence breaking at infinite N
Let us provide an example of independence breaking non-Gaussianities that persist in the N → ∞ limit, showing how one can control higher cumulants by adjusting the correlations between random variables.Consider the normalized sum of N random variables, where X i is the product of two random variables a i and h i , This architecture can be interpreted as the last layer of a fully connected NN, where h i are the outputs of the neurons in the previous layer, a i / √ N are the weights, and ϕ is the output.First, let us consider the simple case where a i and h i are independent, Gaussian random variables 7 , where σ a and σ h are positive and finite.Since a i and h i are independent, so are X i .The CLT applies and ϕ is Gaussian.
Next, we will perturb P(⃗ a, ⃗ h) to break independence.To that end, we introduce an auxiliary random variable H and define, where we set the standard deviation of H to σ h , for simplicity.We then define a correction term, Finally, putting these together we define, (2.28) When α = 0, the second term vanishes and both a i and h i are independent.As we turn on α > 0, the a i remain independent, but correlations are induced between the h i through a direct coupling to H in P corr (⃗ a, ⃗ h).
To quantify the non-Gaussianity of ϕ as a function of α, we compute the CGF, (2.29) As P(⃗ a, ⃗ h, H; α) is Gaussian, (2.29) can be evaluated analytically to give The odd cumulants vanish, as the ϕ ensemble has a Z 2 symmetry ϕ → −ϕ (due to evenness of P(a)), while the even cumulants κ ϕ r can be computed by taking derivatives of W ϕ [J].For example, the second and the fourth cumulants are ) (2.32) In the limit N → ∞, α → 0, the second cumulant is finite while all higher cumulants vanish, and ϕ is Gaussian as expected.At finite α > 0, all even cumulants are finite and in general nonzero.The ability to tune α thus allows one to control the degree of non-Gaussianity of ϕ.Note that breaking independence in the large N limit is not a particularly efficient way to sample from a non-Gaussian distribution of a single variable.

Connected correlators in NN-FT
We wish to establish that the ideas exemplified above-that non-Gaussianities may arise via finite-N corrections or independence breaking-generalize to continuum NN-FT.
In outline, one may think of this conceptually as passing from a single random variable ϕ (0d FT) to a discrete number of random variables ϕ i (lattice FT), and finally to a continuous number of random variables ϕ(x) (continuum FT), where x ∈ R d .This is a textbook procedure in the context of the function-space path integral.Here we wish to instead emphasize the general procedure and parameter space perspective.
Consider the case that the continuum field ϕ(x) is built out of neurons h i (x) [10] as If the h i (x) are independent, the CLT states that ϕ(x) is Gaussian in the limit N → ∞.This is the essence of the NNGP correspondence.Motivated by the single variable case, we will study non-Gaussianities arising from both finite-N corrections and breaking of the independence condition.The CGF of ϕ(x) is where we have performed a series expansion in terms of the cumulants, a.k.a. the connected correlation functions G (r) c of ϕ.This is a straightforward generalization of (2.5) to the continuum.When the odd-point functions vanish the connected four-point function is which will capture leading-order non-Gaussianities in many of our examples.
In the following, we will quantify non-Gaussianities in terms of non-vanishing cumulants, as well as directly in the action via an Edgeworth expansion.

Finite-N corrections with independent neurons
We first study non-Gaussianities in the case where the neurons h i (x) are i.i.d.but N is finite, e.g.single hidden layer networks, shown in [42].We can express the CGF (2.34) in terms of the connected correlation functions of the neurons, This result relies on the fact that for independent h i , the expectation of the product is the product of the expectations, which turns the first expression into a sum on neuron CGFs.For identically distributed neurons the sum gives a factor of N, and the normalization 1/ √ N gives the r-dependent N-scaling.This result lets us express the connected correlators of ϕ(x) in terms of the connected correlators of h i (x), establishing Gaussianity in the limit.

Examples: single layer Cos-net and ReLU-net
We will now consider two single hidden layer architectures with finite N and i.i.d.parameters.While the methods we describe in this section can be employed to study NN with arbitrary depth L > 1, inducing statistical correlations among neurons [42], single hidden layer architectures suffice to demonstrate their utility.

ReLU-net
First, we will consider an architecture with a single hidden layer and ReLU activation functions.As ReLU activations are ubiquitous in ML applications, this is a natural example to study.Consider N ).We compute the two-point function in the parameter space description (1.2) to obtain which has a factorized structure in the terms that one might call bi-local: the function depends independently on x and y, regardless of any relation between them.This result is exact and does not receive 1/N corrections.Non-Gaussianities induced by 1/N corrections manifest as a nonzero 4-pt connected correlation function, ) scales as 1/N.

Cos-net
Next, let us study a single hidden layer network with cosine activation functions.The NN-FT associated to Cos-net (and its generalizations) is Euclidean invariant [10], which is interesting on physical grounds, e.g. to satisfy one of the OS axioms to establish an NN-QFT.Euclidean invariance may be established using the mechanism of [8] for determining symmetries from parameter space correlators, which absorbs symmetry transformations into parameter redefinitions, yielding invariant correlators when the relevant parameter distributions are invariant under the symmetry.Cos-net was defined in [10], where its 2-point function and connected 4-point function were also computed.The architecture is where As before, the correlation functions are computed in parameter space (1.2).The 2-pt function is manifestly translation invariant, with ∆x 12 = x 1 − x 2 .The 4-pt correlation function is where ∆x ij := x i − x j and P(abcd) denotes the three independent ways of drawing pairs (x a , x b ), (x c , x d ) from the list of external vertices (x 1 , x 2 , x 3 , x 4 ).
We see the manifest Euclidean invariance of these correlators, and that non-Gaussianities are encoded in G (4) c,Cos as a 1/N corrections.

Generalized connected correlators from independence breaking
We now wish to generalize our theories and connected correlators to including the possibility that non-Gaussianities arise not only from 1/N-corrections, but also from independence breaking, e.g. by developing correlations between the neurons h i (x).Previously, [10,42] studied mixed non-Gaussianities at finite N and statistical correlations among neurons.
Generalizing our approach from section 2.2, we parameterize breaking of statistical independence by promoting the distribution of neurons P(h) to depend on a vector of hyperparameters ⃗ α ∈ R q , P(h; ⃗ α).Since independence is necessary for Gaussianity via the CLT, and we will sometimes wish to perturb around the Gaussian fixed point, we require where the hyperparameter vector ⃗ α must be chosen as part of the architecture definition.From this expression, the neurons become independent when ⃗ α = 0.For a general P(h; ⃗ α), the CGF is . (2.46) For small values of α, we can expand P(h; ⃗ α), . (2.47) Analogous to the single variable case, we define satisfying the recursion relation (2.49) Finally, we can expand (2.34) in ⃗ α, where W ϕ,⃗ α=0 [J] is given in (2.36).This form of W ϕ [J] makes it clear how one can tune N and ⃗ α to generate and manipulate non-Gaussianities; for details see appendix C.
For appropriately small independence breaking hyperparameter ⃗ α, and other attributes of the architecture, the ratio of second term to first term in the logarithm of (2.50) is small.In such cases, one can approximate (2.50) using Taylor series expansion log(1 + x) ≈ x around x = 0.The CGF becomes and the cumulants (2.52) are proportional to ⃗ α at the leading order.The leading order expression in ⃗ α is evaluated in (C.21).

Example: single layer Cos-net
Let us exemplify the non-Gaussianities generated by statistical independence breaking of a single layer Cos-net architecture given in (2.42).We can break this independence by modifying the distribution from which the weights W 0 ij (an N × d matrix) are sampled where c is a normalization constant.The rotational invariance preserving term αIB(Tr(W 0T W 0 )) 2 N 2 introduces mixing between the weights W 0 ij and parametric independence is explicitly broken.The degree of independence breaking can be controlled by tuning α IB .
We wish to compute the connected correlation functions to quantify the non-Gaussianities generated by independence breaking.In general, this is a difficult problem.However, when α IB ≪ 1, we can perform a perturbative expansion in α IB .Setting d = 1 for simplicity, we obtain to leading order in α IB , where

44). Non-Gaussianities at finite
N, and α IB ̸ = 0 still preserve the translation invariance of the 2nd and 4th cumulants of Cos-net architecture.We refer the reader to appendix B.2 for details, where we also compute leading order non-Gaussian corrections to first two cumulants in a single hidden layer Gauss-net at α IB ̸ = 0, finite N, for d = 1.

Computing actions from connected correlators
In section 2 we systematically studied non-Gaussianities in NN FT by parametrically violating two assumptions of the CLT: infinite-N and independence.The study was performed at the level of connected correlators, rather than actions, because every NN-FT admits a parameter space description of connected correlators, even if an action is not known.
In this section we will develop these techniques for calculating actions from connected correlators, including in terms of Feynman diagrams in which the connected correlators are vertices.More specifically: • Field Density from Connected Correlators: Edgeworth Expansion.In section 3.1 we review how knowledge of the cumulants of a single random variable may be used to approximate its probability density, and then we generalize to the FT case, which has a continuum of random variables.This gives an expression for ) in terms of connected correlation functions.We present an explicit example in the case of a single variable.• Computing the Action with Feynman Diagrams.Given the Edgeworth expansion, we develop a method to compute the action perturbatively via Feynman diagrams, which becomes clear due to a formal similarity between the Edgeworth expansion and the partition function of a FT.This is a result that is applicable to general FT.• NN FT Actions.In section 3.3 we specify the analysis of section 3.2 to the case of NN FT.We derive the leading order form of the action for the case of non-Gaussianities induced either by 1/N-corrections or independence breaking.• NN FT Examples.In section 3.4 we derive the leading-order action in 1/N for concrete NN architectures.

Field density from connected correlators: Edgeworth expansion
The Edgeworth expansion from statistics (see e.g.[43] for a textbook statistics description and [11] for an ML study) can be used to construct the probability density from the cumulants.The key observation which allows the Edgeworth expansion to be applied in a FT is that the normal relation for the generating function in terms of the action can be inverted to express the action in terms of the generating functional.Adding a source term in the exponent, mapping J → i J and integrating over J, we have ˆdJe where has been used.Deforming the J integration contour back to real J then results in This gives the probability density and action in terms of W[J].This result can also be thought of as arising from an inverse Fourier transform of the characteristic function.
Then to apply the Edgeworth expansion for a single random variable ϕ, we write W[J] in terms of cumulants which lets us write where the Gaussian integral has been performed by mapping J → iJ (alternatively, working with the characteristic function the whole time) and we have neglected the normalization factor.We have an expression for the density P ϕ as an expansion around the Gaussian with mean κ 1 and variance κ 2 .
The result may be extended to the FT case, where ϕ is replaced by ϕ(x), a continuum of mean free random variables.Then the relation is where the GP action S G is defined as To the extent that there is a perturbative ordering to the correlators through some expansion parameter (such as 1  N or independence breaking), this expression can be evaluated perturbatively to systematically construct an action from the cumulants8 .

1D example: sum of N uniform random variables
Let us demonstrate the Edgeworth expansion in a simple example.Consider the standardized sum of N i.i.d.random variables sampled from a uniform distribution The cumulants of X i are ) where B r is the rth Bernoulli number 9 .Plugging this into (2.15), the cumulants of ϕ are At finite N, the cumulants κ ϕ r>2 are nonzero and ϕ is non-Gaussian.Using these cumulants, we can write down the probability distribution function of ϕ via an Edgeworth expansion, Truncating the sum at r = 4, expanding the exponential, and keeping terms up to O(1/N) we get where on the second line we absorbed the constant term into the normalization constant Z ′ .At order O(N 0 ), the exponent in (3.15) is quadratic and ϕ is Gaussian distributed.Gaussianity is then broken by a quartic interaction at order O(1/N).It is worth noting that the cumulants of ϕ are given by simple closed form expressions, see equation (3.13), while P ϕ involves a perturbative expansion in 1/N.This is in contrast to weakly coupled FT, where we often start from a simple action expressed in closed form and calculate the connected correlation functions via a perturbative expansion in the coefficients of interaction terms.

Computing the action with Feynman diagrams
In a FT a powerful tool for organizing a perturbation expansion is with Feynman diagrams.Just as Feynman diagrams can be used to compute the cumulants perturbatively in an expansion parameter from an action, they can also be used to compute the action perturbatively from the cumulants.To understand the derivation, recall the expression for the partition function where we have introduced couplings g r instead of g r /r!, and ∆(x 1 , x 2 ) is the free propagator.The expression (3.16) arises by taking the usual expression for the partition function and replacing the ϕ's in the interaction terms by δ/δJ's.Pulling the J-derivatives outside of the ´Dϕ in (3.18) and performing the Gaussian integral yields (3.16).These manipulations closely mirror the Edgeworth expansion.The Edgeworth expansion (3.7) is related to the partition function (3.16) by a simple change of variables, given in table 1, which one might think of as a duality map between a field picture and a source picture.This relationship between the Edgeworth expansion and the partition function immediately tells us that the analog of g r (x 1 , . . ., x n ) are the connected correlation functions We may therefore compute the couplings g r (x 1 , . . ., x n ) in the same way that we compute the connected correlators
in the case of a six-point vertex.Notably, the vertex is itself a function and lines enter the n-point vertex at n locations.
To compute the coupling g r (x 1 , . . ., x r ) in terms of Feynman diagrams, one sums over all connected r-point Feynman diagrams made out of G (n) c vertices.By convention, we do not label internal points on the vertices, in order to simplify the combinatorics.For instance, the four-point coupling g 4 (x 1 , . . ., x 4 ) has a diagram where it is to be understood that connections to internal points in a vertex appear in all possible combinations.Analytic expressions may be obtained from the diagrams via the Feynman rules given in table δϕ(x i )δϕ(y j ) involves differential operators, it can be evaluated by Fourier transformation, see appendix D As an example, let us compute a contribution to the quartic coupling where the dots represent contributions from other diagrams, and 'perms' represents other diagrams from permutations over internal points.A combinatoric factor of 4! from summing over internal points cancels out the prefactor 1/4! from Edgeworth expansion.The Edgeworth expansion (3.7) involves an infinite sum.Correspondingly, computing g r (x 1 , • • • , x r ) requires summing over infinitely many Feynman diagrams.When all but finitely many terms in the expansion are parametrically suppressed, the expansion can be truncated at finite order to provide an approximation of g r (x 1 , • • • , x r ).We will apply these rules to concrete examples later in this section and demonstrate how approximations to g r (x 1 , • • • , x r ) can be obtained systematically.
While our focus is on NN FT, we emphasize that Edgeworth expansions can be utilized in any FT where the connected correlation functions are known, and the expansion in (3.7) is not divergent.

Example: non-local ϕ 4 theory
Aside from any application in NN-FT, it is interesting to study the self-consistency of the Edgeworth expansion.We do so in a famous case, ϕ 4 theory, generalized to the case of non-local quartic interactions, in order to demonstrate the ability of the Edgeworth method to handle non-locality.Consider the action where G (2) G,ϕ (x 1 , x 2 ) −1 and λ(x 1 , x 2 , x 3 , x 4 ) are both totally symmetric, and G (2) ), and at leading order, where the 1 2 is a symmetry factor.Similarly, There are no other connected correlators that have contributions at O(λ).To perform an Edgeworth expansion, we first need to write down the inverse propagator, Given (3.29), it is easy to verify that ˆdx ′ G (2) At this point, let us introduce a shorthand notation to improve readability, rewriting ´dd ) respectively.Finally, we obtain the Edgeworth expansion at O(λ) by plugging in (3.29) and (3.28) into (3.7), where δ 1 := δ/δϕ(x 1 ).Expanding the first exponential and performing the derivatives we obtain with ), and ϕ x := ϕ(x).The second term does not depend on ϕ and can be absorbed into the normalization factor, resulting in We have recovered the ϕ 4 action at O(λ), as expected.

General interacting actions in NN-FT
We now study the Edgeworth expansion in NN FT.We will modify the general analysis of the previous section to the case where non-Gaussianities are generated by the two mechanisms we described in section 2, namely, by violating assumptions of the CLT by finite N corrections and independence breaking.

Interactions from 1/N-corrections
As we discussed in section 2.3.1, non-Gaussianities arising due to 1/N corrections result in connected correlation functions that scale as for a single hidden layer network.At large N, the action can be approximated systematically by organizing the Edgeworth expansion in powers of 1/N, calculating the couplings via Feynman diagrams, and truncating at a fixed order in 1/N.
To do so, we need to know how the couplings scale with N. We have studied a case in (3.25)where only the even-point correlators are non-zero, and clearly there is a 1/N contribution to g 4 from a single G (4) c vertex; any higher order correlator G (r>4) c contributes at 1/N r/2−1 and higher.Consider now contributions to the couplings g r>4 .There is a tree-level 1/N r/2−1 contribution from a single G (r) c vertex and there are 1/N n/2−1 contributions from a G (n>r) c vertex with an appropriate number of loops; both are more suppressed than the 1/N contribution to g 4 .Finally, consider contributions from V number of G (n<r) c vertices.Forming a connected diagram requires nV > r, which implies V ⩾ 2 and therefore the contribution is of order 1/N ⩾n−1 , which is more suppressed than 1/N since n begins at 3 in the Edgeworth expansion.Therefore, the single-vertex tree-level contribution to g 4 is the leading contribution in 1/N.
The quartic coupling g 4 (x 1 , x 2 , x 3 , x 4 ), at leading order in G We may compute this coupling in a NN-FT by first computing G (4) c in parameter space.In summary, the leading-order in 1/N action for a single layer NN-FT is where g 4 at O(1/N) is given in (3.39), under the assumption that the odd-point functions are zero, as in the architectures of section 3.4.

Interactions from independence breaking
Non-Gaussianities generated via independence breaking alone are qualitatively different than those from 1/N corrections.We wish to determine the leading-order action due to independence breaking.Focusing on the case where independence breaking is controlled by a single parameter α for simplicity, it follows from (2.52), that the connected correlation functions scale as | r>2 of the free theory vanish.
As a result, each coupling g r (x 1 , • • • , x r ) receives contributions from tree-level diagrams of all connected correlators, at leading order in α.More generally, at any given order in α, there are infinitely many diagrams from all connected correlators to g r (x 1 , • • • , x r ).For example, the expansion for g where summing over internal points y i cancels out 1 2n! prefactor from each G (2n) c . The terms in the parenthesis constitute an infinite sum.
This structure makes it impossible to systematically approximate g r (x 1 , • • • , x r ) with a finite number of terms via a perturbative expansion in α, unless some other structure correlates with it.Note that this is a feature of NN FT where non-Gaussianities are generated only by independence breaking.Approximation via a finite number of terms would be possible in cases where connected correlation functions scale with both α and 1/N.In the limit of N → ∞, the leading-order in α action for a NN-FT is where g r>4 's are computed similar to (3.43).Such an action can not be approximated by a finite truncation, unless the theory exhibits additional structure.

Example actions in NN-FT
Next, we exemplify the Feynman rules from section 3.2 in a few single layer NN architecture examples at finite width and i.i.d.parameters, and evaluate the leading order in 1/N quartic coupling and NN-FT action.The quartic coupling is c (x 1 , y 1 ) −1 involves differential operators, we use the methods from appendix D to evaluate g 4 .

Single layer Cos-net
Recall the Cos-net architecture introduced earlier, We will consider the case where all parameters are independent and non-Gaussianities arise due to finite N corrections.To evaluate the leading order quartic coupling for this NNFT, let us first compute the inverse propagator c,Cos (x, y) −1 as a translation invariant operator.Then, performing a Fourier transformation of the 2-pt function and its inverse operator, followed by an inverse Fourier transformation, we obtain where ∇ 2 x := ∂ 2 /∂x 2 .Here, we use (D.3) to evaluate the quartic coupling as, The NNGP action is local, but the leading order quartic interaction is non-local.

Single layer Gauss-net
As our next example, consider the output of a single-layer Gauss-net for parameters drawn i.i.d.from W 0 ∼ N (0, N ), and b 0 ∼ N (0, σ 2 b0 ).The propagator is identical to Cos-net FT, and so is G (2) c,Gauss (x 1 , x 2 ) −1 .We evaluate Gauss-net quartic coupling g 4 , using (D.3), and (B.12) for G(4) c,Gauss , as where P(ab, cd) and P(abcd) are defined as before.Thus, Gauss-net FT action at O(1/N), differs from Cos-net FT at the level of quartic interaction.

Engineering actions: generalities, locality, and ϕ 4 theory
In section 3 we used the Edgeworth expansion and a 'duality' between fields and sources to compute couplings (including non-local ones) in the action as connected Feynman diagrams whose vertices are given by the usual connected correlators This general FT result is applicable in NN-FT of fixed architectures, but it does not answer the question of how to engineer an architecture that realizes a given action.
In this section we study how to design actions of a given type by deforming a Gaussian theory by an arbitrary operator.The result is simple and exploits the duality between the parameter-space and function-space descriptions of a FT.The main results are: • Action Deformations.We develop a mechanism for expressing an arbitrary deformation of a Gaussian action as a deformation of the parameter density of a NN-FT.• Local Lagrangians.We utilize the mechanism to engineer local interactions.
• ϕ 4 Theory as a NN-FT.Using a previous result that achieves free scalar FT as a NN-FT, we engineer local ϕ 4 theory as an NN-FT.• Cluster Decomposition.We develop an approach to cluster decomposition, another notion of locality that is weaker than local interactions.
We also discuss why it might have been expected that ϕ 4 theory (and other well-studied FT) arises naturally at infinite-N.
To begin our analysis, consider the partition function of a Gaussian theory where we have labelled both the partition function and the expectation with a G subscript to emphasize Gaussianity.Now we wish to define a deformed theory that differs from the original only by an operator insertion, treating it in both function space and parameter space.The deformed partition function is given by where O ϕ is a non-local operator (though it may be chosen to be local) that has a subscript ϕ, denoting that it may depend on ϕ and its derivatives.In the function space, the partition function of the Gaussian theory is and the operator insertion corresponds to a deformation of the partition function to where the action has been deformed We may treat this theory in perturbation theory in the usual way: correlators in the non-Gaussian theory are expanded perturbatively in λ and evaluated using the Gaussian expectation E G , which utilizes the Gaussian action when expressed in function-space.
How is this deformation expressed in parameter space, i.e. how do we think of this deformation from a NN perspective?In parameter space, the Gaussian partition function is We remind the reader that in such a case Gaussianity is not obvious, but requires a judicious choice of parameter density P(θ) and architecture ϕ θ (x) such that we have a NN GP via the CLT.In parameter space, the deformation yields where we assume that where the operator O ϕ θ does not involve an explicit ϕ(x), but instead its parameter space representation; we will exemplify this momentarily.Again, correlators may be computed in perturbation theory in λ by expanding and evaluating in the Gaussian expectation, this time in the parameter space formulation.We emphasize that if the function space and parameter space descriptions (4.3) and (4.6) represent the same partition function, then the deformed theories (4.4) and (4.7) are the same theory.That is, we see how an arbitrary deformation of the action induces an associated deformation of the parameter space description.We will use this in section 4.2 to engineer ϕ 4 theory as a NN FT, and in 4.1 we will more explicitly deform a NN GP.
We end our general discussion with some theoretical considerations in NN FT, interpreting a non-Gaussian deformation O ϕ θ in terms of the framework of section 2, and also taking into account the universal approximation theorem.
A non-Gaussian deformation O ϕ θ must violate an assumption of the CLT.The architecture itself is still the same ϕ θ (x) as in the Gaussian theory.Instead, in (4.7) we may interpret the operator insertion as i.e. same architecture, but with a deformed parameter distribution.This makes it clear that our non-Gaussian theory is still at infinite-N and therefore cannot receive non-Gaussianities in 1/N-corrections.Instead, it receives non-Gaussianities because the deformed parameter distribution has independence breaking via the non-trivial relationship amongst the parameters in the deformation.There may also exist schemes for controlling non-Gaussian deformations in 1/N, instead of via independence breaking, but it is beyond our scope.Was it inevitable that systematic control over non-Gaussianities arises most naturally via independence breaking rather than 1/N-corrections?The general answer is not clear, but we may use the control over non-Gaussianities to yield common theories, such as ϕ 4 theory in the next section.In that context we may ask a related question: was it inevitable that we obtain common interacting theories via independence breaking rather than 1/N corrections?This question has a better answer.Finite action configurations of a common theory, say ϕ 4 theory are not arbitrary functions, since there may be some functions ϕ(x) that have infinite action.However, finite action configurations are still fairly general functions, and since they have finite action they occur with non-zero probability in the ensemble.On the other hand, there are universal approximation theorems for NN, where the error in the approximation to a target function may decrease with increasing N.In such a case this theorem that is usually cited as a feature in ML may actually be a bug: at finite-N there exist functions that cannot be explicitly realized by a fixed architecture, but only approximated.We therefore find it reasonable to expect that there is at least one finite-action configuration ϕ(x) in ϕ 4 theory that cannot be realized by a finite-N NN of fixed architecture; in such a case, a NN-FT realization of ϕ 4 theory must be at infinite-N.This comment only scratches the surface, but we find the interplay between universal approximation theorems and realizable FT at finite-N to be worthy of further study.

Non-Gaussian deformation of a NN GP
To make the general picture more concrete, we would like to consider non-Gaussian deformations of any NN GP.The main result is that we may deform any NNGP by any operator we like, which breaks independence by deforming the parameter density, explaining the origin of non-Gaussianities by violating the independence.
As before, we consider a field built out of neurons, where the full set of parameters θ is realized by the set of parameters a i and the set of parameters θ h of the post-activations or neurons h.This equation forms the field out of a linear output layer with weights a i acting on the post-activations, which could themselves be considered as the N-dimensional output of literally any NN.If the reader wishes, one may take ϕ to be a single-layer network by further choosing with σ : R → R a non-linear activation function such as ReLU or tanh; with this additional choice we now have θ h comprised of b-parameters and c-parameters.Taking the parameter densities P G (a) and P G (θ h ) to be independent and N → ∞, ϕ(x) = ϕ θ (x) is drawn from a GP; we have again used a subscript G to emphasize that these are the parameter densities of the Gaussian theory.
Deforming the Gaussian theory by an operator insertion, which in general is non-Gaussian, we have We may interpret the operator insertion as deforming the independent Gaussian parameter density P G (a)P G (θ h ) to a non-trivial joint density The partition function is then an infinite-N non-Gaussian NN-FT where the operator insertion deforms the parameter density.At initialization, if one draws the parameters θ h first, one may think of this as affecting the density from which the a-parameters are drawn; the draws of a-parameters are no longer independent.
For the sake of concreteness, consider the case of the single-layer network and take a general non-local quartic deformation.Then the operator insertion is where Einstein summation is implied and we have absorbed the overall λ into the definition of the non-local coupling g 4 (x 1 , . . ., x 4 ).Inserting the equation for the NN into the deformation, we obtain ..a i 4 σ(b i 1 j 1 x j 1 +c i 1 )...σ(b i 4 j 4 x j 4 +c i 4 )/N 2 .( Then is the partition function of a infinite-N NN-FT, as we impose lim N → ∞, with quartic non-Gaussianity induced by the breaking of independence in the joint parameter density P(a, b, c).

ϕ 4 theory as a NN FT
To end this section and demonstrate the power of this technique, we would like to engineer the first interacting theory that any student learns: local ϕ 4 theory.The action is Following our prescription, we • Engineer the NNGP.Using the result of [10], we take where the sum runs from 1 to N = ∞, b i is the vector that is the ith row of the matrix b ij , and the parameter densities of the Gaussian theory are where B d Λ is a d-sphere of radius Λ.The density P G (b i ) is not independent in the vector index j, but all that is needed for Gaussianity is independence in the i index, which is clear due to the product nature of P G (b).The power spectrum (Fourier-transform of the two-point function) is which becomes the standard free scalar result 1/(p 2 + m 2 ) by a trivial rescaling This NN GP is equivalent to the free scalar theory of mass m in d Euclidean dimensions, with where Λ plays the role of a hard UV cutoff on the momentum.• Introduce the Operator Insertion.Given the NNGP above, or any other NNGP realizing the free scalar FT, we wish to insert the operator where for ϕ a,b,c (x) it is to be understood that the RHS of (4.26) is inserted, yielding an expression that is only a function of a's, b's, and c's.• Write the Partition Function.We then have a partition function for the deformed theory, given by where again it is to be understood that we insert the RHS of (4.26) for ϕ a,b,c and (4.29) for P(a, b, c); there are no explicit fields in the expression, it depends only on the architecture (which includes parameters a,b,c) and the joint parameter density.
Thus, the architecture (4.26) and parameter density (4.29) realize local ϕ 4 theory via the partition function (4.30).We discuss the connections between GPs, locality, and translation invariance in appendix E. Let us briefly address RG flows.The definition of a fixed non-Gaussian theory here involves the choice of a fixed value of λ, in addition to the choice of a fixed value of Λ that was implicit in the fixing of the GP.From that starting point, decreasing Λ while keeping the correlators fixed induces an RG flow for λ governed by the usual Callan-Symanzik equation.In the language of the NN architecture, this is interpreted as a flow in the parameter density that is necessary to fix the correlators as Λ is decreased.

Cluster decomposition and space independence breaking
We now turn to a weaker notion of locality: cluster decomposition.Given a field ϕ(x) (or NN in our context) we say that it satisfies cluster decomposition if all connected correlation functions G (r) c (x 1 , . . ., x r ) asymptote to zero in the limit where the separation between any two space points x i , x j , i ̸ = j is taken to ∞, If the probability density function of ϕ has the form where Z is a normalization constant and n is finite, we say that ϕ(x) has a local Lagrangian density.This is a stronger notion of locality compared to cluster decomposition, as any theory with a local Lagrangian density satisfies cluster decomposition, but the converse is not true [44].
Checking whether a theory satisfies cluster decomposition requires knowledge of the asymptotic behavior of correlation functions, but not the probability density function.As calculating the probability density function of an NN-FT is more challenging than computing the correlation functions, checking cluster decomposition is easier than determining whether there exists a local Lagrangian density that describes the system.
The main result we describe in this section is a framework that enables engineering NN architectures that satisfy cluster decomposition.

Space independent FT
We will perform our analysis by studying, and then moving away from, a case with a very strong assumption: FT that are defined by fields that have independent statistics at different space (or space) points x I .We call these FT space independent (SI) FT.While one can still view such fields as random functions defined on a continuously differentiable space, in general the field configurations are discontinuous; avoiding this would require statistical correlations between nearest neighbors, violating the assumption.This 'd-dimensional' FT is really a collection of uncountably many independent 0-d theories.This means that the partition function factorizes where the product runs over all space points x I .This form is agnostic about the origin of the statistics and may be specified in either the function space or parameter space description.In parameter space, independent statistics at different space points x I means that the SI theory has partition function i.e. each space point x I has its own ensemble of NN ϕ θI (x I ) with its own set of parameters θ I that is independent of θ J for I ̸ = J.In function space, independence means that i.e. the action is such that the path integral factorizes.An immediate consequence of this factorization is that the action cannot contain derivatives of ϕ(x I ), as these would depend on the value of ϕ not only at point x I , but a local neighborhood around it.Then, the action is of the form, which, turning the product into a sum in the exponent, gives the more canonical form enjoy an S L permutation symmetry, where the number of space points L is infinite in the continuum limit.Before introducing correlations between the field values at different space points, let us first study the statistics of the SI theory.Denote the cumulants of ϕ at a given point x I as κ ϕ r (x I ).For simplicity, we will assume that the field values at different space points are identically distributed, i.e. κ ϕ r (x I ) = κ ϕ r is fixed for all I, which will also be important for translation invariance.We also assume that they are mean free, κ ϕ 1 = 0. Next, we consider the CGF, which takes the form where W[J; x] is the CGF of ϕ at space point x.Just as the partition function Z[J] factorizes into a product of partition functions associated to individual space points, the CGF W[J] = log Z[J] becomes a sum (or integral, in this case).The connected correlators are easily computed by taking derivatives 10 , where ∂J(x I )/∂J(x J ) = δ(x I − x J ), and the connected correlation functions of SI networks ϕ SI simplifies to with n delta functions.The n-point connected correlator is nonzero only when , and its magnitude is determined by κ ϕ n .The correlation functions can be written in terms of the connected correlators.For example, the two point function of ϕ is As ϕ(x 1 ) and ϕ(x 2 ) are independent and mean free, G (2) ϕSI (x 1 , x 2 ) is nonzero only when x 1 = x 2 .Similarly, the four point function is The statistics of the theory is completely determined by the space independence assumption and the cumulants κ ϕ r .The general n-point function can be expressed as where S n denotes partitions of the set {1, . . ., n}.

Space-time independence breaking
Clearly we do not want to stop with SI theories.We will now introduce correlations between different space points to 'stitch together' the L 0-dimensional theories (associated to the SI fields) into a d-dimensional FT.This requires modifying the theory in some way so that there are non-trivial correlations between field values at different points.One way to do so is to define new field variables Φ(x I ) as a function of the SI fields ϕ(x I ), Φ(x I ) = Φ (ϕ(x 1 ), . . ., ϕ(x L )) .(4.44) As the value of Φ at site x I in principle depends on the values of ϕ at all space points, Φ(x I ) and Φ(x J ) are correlated in general, even when I ̸ = J.The statistics of Φ(x I ) are then determined by the functional form of (4.44), as well as the statistics of ϕ(x I ).However, such a general formulation (4.44) is unwieldy, and we therefore simplify the picture.We will describe a family of architectures where Φ(x I ) is constructed by a simpler ansätz, a smearing of ϕ(a) a∈{x1,••• ,xL} across all space points, and write down a necessary and sufficient condition to satisfy cluster decomposition.Consider the architecture, for some continuous and differentiable function f(x I − a).First, note that although a generic draw of ϕ(a) is discontinuous due to independence across different points in space, Φ(x I ) is rendered continuous by the smearing.Furthermore, if the function f is nonzero everywhere, Φ(x I ) will have correlations between all pairs of lattice sites.We wish to check whether cluster decomposition is satisfied, and therefore need to compute correlation functions of Φ(x).The Φ-correlators are given by (4.46) As f does not depend on ϕ, we can carry out the expectation value over ϕ to obtain where The only contribution to the connected correlator of Φ(x) comes from the connected piece of G (n) ϕ (a 1 , . . ., a n ) with n delta functions 11 , Evaluating the integral, we obtain Cluster decomposition is satisfied if and only if (4.49) asymptotes to zero in the limit where the separation between any two of the space points x I , x J is taken to ∞.Any smearing function f (x) that decays faster than 1/x asymptotically satisfies this condition 12 .

Example: Gaussian smearing
We now present an example with a particular choice of the smearing function f and show that the resulting theory satisfies cluster decomposition.Let for some β > 0. As before, we will consider a case where ϕ(x) at different space points are identically distributed, with cumulants κ n ϕ . 13Following equation (4.49), the cumulants of Φ(x) are then given by where The dependence of the connected correlators (4.52) on the space coordinates x i is completely determined by the choice of smearing function f, while their magnitudes depend both on f as well as the cumulants κ n ϕ .Although our main motivation here has been to engineer NN architectures that satisfy cluster decomposition, smearing layers offer great flexibility in manipulating the connected correlators and might be useful in designing NN with other desired properties.

Conclusions
In this paper we continued the development of NN FT (NN-FT), a new approach to FT in which a theory is specified by a NN architecture and a parameter density.This description enables a parameter space description of the statistics, yielding a different method for computing correlation functions.For a more detailed introduction to NN-FT, see the introduction and references therein.
We focused on three foundational aspects of NN-FT: non-Gaussianity, actions, and locality.Via the CLT, many architectures admit an N → ∞ limit in which the associated NN-FT is Gaussian, i.e. a generalized free FT.In the ML literature, these are called NNGPs.In section 2 we demonstrated that interactions arise from parametrically violating assumptions of the CLT, yielding non-Gaussianities arising from 1/N-corrections, as well as the breaking of statistical independence and the identicalness assumption.These interactions are apparent via parameter-space calculations of connected correlation functions, but manifest themselves as non-Gaussianities in the field density P[ϕ] = exp(−S[ϕ]).In section 3 we developed a technique that allows for the action to be computed from the connected correlation functions, via connected Feynman diagrams.This is an inversion of the usual approach in FT: we compute coupling functions in terms of connected correlators, rather than the other way around.The technique was applied to NN-FT, including an analysis involving the parametric non-Gaussianities we studied.In section 4 we studied how to design architectures that realize a given action.We do so by deforming an NNGP by an operator insertion that, from a function-space perspective, corresponds to a deformation of the GP action.However, since we know the architecture we may also express the deformation in parameter space, in which case the non-Gaussianity associated to a given deformation of the action has a natural interpretation as a deformation of the NN parameter density.That is, the interactions arise from independence breaking.We apply this technique to induce local interactions, and derive an architecture that realizes ϕ 4 theory as an infinite NN-FT.

B.2. Cos-net cumulants at finite N, non-i.i.d. parameters
The output of a single hidden layer, finite N, fully connected feedforward network with cosine activation function is given by, . For simplicity, we focus on the d = 1 case; the statistical independence of first linear layer weights can be broken by a hyperparameter α IB ≪ 1, then the correlated weight distribution is where c is a normalization constant.The cumulative non-Gaussian effects due to finite width and non-i.i.d.parameters alter all correlation functions, including the 2-pt function at finite width.Using perturbation theory at leading order in α IB , the 2nd and 4th cumulants are evaluated as the following, where ∆x ij := x i − x j .The Fourier transformation of this cumulant at α IB = 0 is where use the convention e i(p1x1+p2x2+p3x3+p4x4) .Next, we present another example where non-Gaussianities arise due to both finite width and non-i.i.d.parameters.

B.3. Gauss-net at finite N, non-i.i.d. parameters
We define the Gauss-net architecture as a single hidden layer, width N, feedforward network with exponential activation function, and an overall normalizing factor, such that the output is , identical as Cos-net.We break the statistical independence of the first linear layer weights similar to the previous example, at d = 1.Then, the 2nd and 4th order cumulants at leading order in α IB are, and, where X ij := x i + x j , and ∆x ij := x i − x j .At α IB = 0, the Fourier transformation of this cumulant becomes the following using the same convention as Cos-net.

B.4. Non-Gaussianity from non-identical parameter distributions
We discussed examples of NN architectures where non-Gaussianities arise at various widths, from the choice of identical but correlated parameter distributions.In addition to this, it is possible to violate CLT through independently drawn dissimilar NN parameter distributions; this too induces non-Gaussianities in NN output distributions.Let us present an architecture where non-Gaussianities at infinite width limit arise due to dissimilar and independent parameter distributions.Consider the NN architecture with output with parameters drawn from W 1 ∼ N (0, σ 2 WL ), b L ∼ N (0, σ 2 bL ), and h L−1 j (x k ) denotes the output of jth neuron in (L − 1)th hidden layer, from input x k .The presence of the prefactor e − j 2 σ 2 at the jth node of final linear layer leads to dissimilarities in the final layer parameter distributions.Let us study the first three leading order cumulants at lim N → ∞, and We used h(x) := h L−1 (x), and identities E[(W . All these cumulants are nonvanishing at lim N → ∞; similarly, one can show that other higher order cumulants are non-vanishing too, adding non-Gaussianities to the output distribution.

Appendix C. CGF and Edgeworth expansion for NNFT
We express the output of a single hidden layer width N NN as a sum over N continuous variables where h i (x) are the outputs of each neuron before they get summed up into the final output.

C.1. Finite N and I.I.D. parameters
The CGF for i.i.d.parameters where J(x) and h i (x) are the source current and output of ith neuron, respectively.In the second last step, we have used the following relation, The last line is obtained using × e i ´dx1 J(x1)G (1)   c (x1)− 1 2 ´dx1 dx2 J(x1)G (2)   c (x1,x2)J(x2) , (C.12) where ∂ r = δ δϕ(x1) • • • δ δϕ (xr) .Next, we evaluate the integral associated with the Gaussian process, ˆDJe −i ´dx1 J(x1)ϕ(x1)+i ´dx1 J(x1)G (1)   c (x1)− 1 2 ´dx1 dx2 J(x1)G (2)   c (x1,x2)J(x2) , (C.13) using a change of variables J ′ (x) → J(x) c (x ′ )] that keeps the measure of the source DJ → DJ ′ invariant.We obtain −S G = −i ˆdx J(x)[ϕ(x) − G (1)  c (x)] − 1 2 ˆdx 1 dx 2 J(x 1 ) G (2)   c (x 1 , x 2 ) J(x 2 ) = − 1 2 ˆdx 1 dx 2 J(x 1 ) + i ˆdx ′ G (2)   c (x 1 , x ′ ) −1 [ϕ(x ′ ) − G (1)  c (x ′ )] G (2)  c (x 1 , x 2 ) J(x 2 ) An integration over J ′ results in the distribution We obtain perturbative corrections around the Gaussian field density by expanding the first exponential term in (C.15) as a series; contributions from higher order cumulants become increasingly less significant in most cases.For appropriately small ⃗ α, the ratio of the second term in the logarithm to the first is small, and one can Taylor expand log(1 + x) ≈ x to obtain, ´dd x h i (x)J(x) • P 1,s ⃗ α=0 . (C.17) The 4-pt function is obtained as G .Next, we evaluate the fourth J-derivative of M and turn the source J off, α s e W ϕ,⃗ α=0  Let us evaluate the expression when G (2) c (y i , x i ) −1 involves differential operators.The integrals over y i cannot be directly evaluated as the eigenvalues of each G (2) c (y i , x i ) −1 are unknown.To avoid this problem, we substitute the operators and cumulant with their Fourier transformations, ˆdd Here f is the Fourier transformation of f, and we obtained the second line by evaluating y i integrals to get δ d (p i + q i ), then integrating q i variables.
When G (2) c is translation invariant, we have G We exemplify this expression for Cos-net and Gauss-net architectures.
ϕ(x) = a σ(b σ(c x)) a ∼ P(a), b ∼ P(b), c ∼ P(c), (1.1) In a weakly coupled FT, one can compute the connected correlation functionsG (r) c (x 1 , • • • , x r ) in terms of the couplings g r (x 1 , • • • , x r )perturbatively via Feynman diagrams.An Edgeworth expansion allows us to do the converse and compute the couplings g r (x 1 , • • • , x r ) in terms of the connected correlation functions G (r) c (x 1 , • • • , x r ).The similarity between (3.7) and (3.16) suggests that the terms in the expansion for g r (x 1 , • • • , x r ) can be represented by Feynman diagrams, whose vertices are connected correlators, e.g.

associated to a local ϕ 4 4 !
interaction.• Absorb the Operator into a Parameter Density Deformation.The non-Gaussian operator insertion deforms the parameter density to P(a, b, c) = P G (a)P G (b)P G (c) e − λ ´dd x ϕ a,b,c (x) 4, ( − ´dd xI (V[ϕ(XI)]−J(xI)ϕ(xI)) .(4.37)This is a FT with a potential, but no derivatives.The field values at different points of space are independent random variables.If they are identically distributed V[ϕ(x I )] is fixed ∀I and the different factors in Z SI[J]

Table 1 .
The Edgeworth expansion for P[ϕ] and the interaction expansion of Z[J] are formally related by a change of variables, given here up to constant factors.Due to this relationship, non-local couplings and connected correlators may both be computed by appropriate connected Feynman diagrams.

Table 2 .
Feynman rules for computing gr from each connected diagram with G c vertices.