Infinite Neural Network Quantum States: Entanglement and Training Dynamics

We study infinite limits of neural network quantum states ($\infty$-NNQS), which exhibit representation power through ensemble statistics, and also tractable gradient descent dynamics. Ensemble averages of Renyi entropies are expressed in terms of neural network correlators, and architectures that exhibit volume-law entanglement are presented. A general framework is developed for studying the gradient descent dynamics of neural network quantum states (NNQS), using a quantum state neural tangent kernel (QS-NTK). For $\infty$-NNQS the training dynamics is simplified, since the QS-NTK becomes deterministic and constant. An analytic solution is derived for quantum state supervised learning, which allows an $\infty$-NNQS to recover any target wavefunction. Numerical experiments on finite and infinite NNQS in the transverse field Ising model and Fermi Hubbard model demonstrate excellent agreement with theory. $\infty$-NNQS opens up new opportunities for studying entanglement and training dynamics in other physics applications, such as in finding ground states.

Introduction.-Quantum states are fundamental objects in quantum mechanics. Generically, the dimensionality of a quantum state grows exponentially with the system size, which provides one fundamental challenge for classical simulations of quantum many-body physics. This is the so-called curse of dimensionality, which also regularly arises in machine learning (ML), where a judicious choice of neural network architecture and optimization method can help address the problem.
Inspired by progress in machine learning, neural networks have have been proposed [1] as a useful way to represent quantum wavefunctions, an idea known as a neural network quantum state (NNQS). The goal is to find a compact neural network representation of the high dimensional quantum state, which is possible because the neural network is a universal function approximator [2,3]; furthermore, they also give exact representations of certain quantum states [4][5][6][7][8][9][10][11], demonstrating their representation power. Recent research has demonstrated that NNQS can achieve state-of-the-art results for computing ground states and the real time dynamics properties of closed and open quantum systems across a variety of domains, including condensed matter physics, high energy physics, and quantum information science [8,9,. Despite this progress, there is ample room for an improved understanding of the representation power and training dynamics of NNQS.
The neural tangent kernel (NTK) [36] has recently emerged as a theoretical tool for understanding the gradient descent dynamics of large neural networks. NTK * Corresponding author: diluo@mit.edu theory utilizes architectures with a discrete hyperparameter N , such as the width of a fully-connected network. In general, gradient descent updates to the network are controlled by a parameter-dependent NTK, but in the infinite-N limit the network evolves as a linear model, with dynamics governed in a ordinary differential equation by a deterministic constant NTK [36][37][38]. This ODE becomes linear and analytically solvable for a meansquared-error loss (See the Supplementary Material for a review of the NTK). Similarly, in the infinite-N limit, networks are often drawn from Gaussian processes [39][40][41][42], in which case they may be trained with Bayesian inference via another deterministic constant kernel, the neural network Gaussian Process (NNGP) kernel [39].
In this work we study infinite neural network quantum states (∞-NNQS), which exhibit both representation power through ensemble statistics and also tractable training dynamics. Specifically, we relate ensemble averages of entanglement entropy bound to neural network correlation functions. For appropriate ∞-NNQS, the ensemble statistics are Gaussian and the correlators are exactly computable. Architectures are presented that approach Gaussian i.i.d. wavefunctions with volume-law entanglement. Furthermore, we develop a general framework for the gradient descent dynamics of NNQS, using a quantum state neural tangent kernel (QS-NTK). Our framework is general and may be applied to various learning setup, such as ground state optimization, quantum state tomography and quantum state supervised learning. In appropriate infinite limits, gradient descent of the ∞-NNQS is governed by a constant deterministic QS-NTK. In the case of quantum state supervised learning, we prove that an ∞-NNQS trained with a positive-definite QS-NTK can recover any target wavefunction. We experimentally demonstrate that the QS-NTK can predict the training dynamics of ensembles of finite width NNQS.
Infinite Neural Network Quantum States.-Consider a quantum state |ψ⟩ represented by a neural network with continuous learnable parameters θ and a discrete hyperparameter N . The wavefunction is ψ θ,N : D → C, where the domain D is problem-dependent. The subscripts θ, N will often be implicit.
An infinite neural network quantum state (∞-NNQS) is a neural network representation in the N → ∞ limit. There are many such limits, according to the identification of a candidate N in a given network architecture, We study cases where this limit is useful either for understanding the entanglement of an ensemble of wavefunctions, via increased control over their statistics, or their gradient descent dynamics. For instance, in many architectures the N → ∞ limit is also one in which the network is drawn from a Gaussian process (GP), where, e.g., N is the width a of a fully-connected network [39][40][41][42] or the number of channels in a CNN [43,44]. The existence of such NNGP limits is quite general [45][46][47], and allows for training with Bayesian inference [39,41].
Quantum State NNGP and Entanglement.-NNQS exhibit unique and interesting entanglement properties [6,10,48,49]. The statistical control offered by this NNGP correspondence allows us to study the entanglement entropy properties of the ensembles of ∞-NNQS. Consider an ensemble of normalized NNQS {|ψ θ ⟩}. We split the input domain D into a subregion A and its complement B as D = A ∪ B, which makes the wavefunction arguments consisted of two variables x A and x B from subregions A and B.
Denote the ensemble average of the n-th Rényi entanglement entropy as ⟨S n ⟩ ≡ E θ S n , where S n = 1 1−n logTrρ n θA is the n-th Rényi entropy of the ensemble over a sub-region A. According to Jensen's inequality, ⟨S n ⟩ ≥ 1 1−n logE θ Trρ n θA for n > 1. It provides a lower bound for entanglement entropy which can be computed from E θ Tr[ρ n θA ] using the replica-trick [50,51]: where G (2n) are the NNQS correlation functions, defined implicitly, and x i,j AB : ,k is over all k and possible x k A and x k B . This provides a means for analyzing the different entanglement entropies. The entanglement entropy bound is particularly tractable for ∞-NNQS, since in the GP limit the correlation functions are determined in terms of the two-point function (GP kernel) via Wick's theorem. See the Supplementary Materials for more details.
Consider ψ(x) = ψ 1 (x)+iψ 2 (x), where both ψ 1 (x) and ψ 2 (x) are drawn from any NN architecture. For example, we analyze the Cos-net [52] NNQS, where ψ 1 (x) and ψ 2 (x) come from the following function form: where d is the input dimension, N is the number of hidden dimension, a i ∼ N (0, . It has been shown that in the infinite N limit, f (x) gives rise to the following 2-pt function [52] By tuning σ w → ∞, it yields a zero-mean Gaussian process so that ψ 1 (x) and ψ 2 (x) are both drawn from i.i.d Gaussian for different values of x. After normalization, such an ensemble of wavefunctions is known to reach the Page value of entanglement entropy and exhibits a volume law entanglement behavior [53,54]. We compare the Von Neumann entanglement entropy of Cos-Net with N = 400, 1000, 4000 with respect to the Page Value entropy subsystem scaling in Fig. 1, which demonstrates nice consistency between our theory and simulations. More details on the simulations can be found in the Supplementary Materials.
More generally, neural networks provide a means for defining ensembles of wavefunctions with entanglement entropy ensemble average bound expressed in terms of NN correlators even away from the GP limit. This provides a new mechanism for engineering ensembles of wavefunctions whose typical states could have interesting entanglement properties. In general, finite-N effects introduce non-Gaussianities into the ensemble [55,56] that correct the entanglement entropies. For instance, Gauss-net [56] and Cos-net yield dual GPs as N → ∞ [57], but have different statistics and even symmetries [58] at finite-N . It opens up the possibility of entanglement engineering of NNQS and provides a framework for studying entanglement structure of NNQS.
Quantum State Neural Tangent Kernel.-∞-NNQS also have interesting gradient descent properties.
We begin with a study of gradient descent for general NNQS. The dynamics of the network are governed by the parameter updateθ where we have expressed the update in terms of a total loss L and also a pointwise loss L, summed over a batch B. Applying the chain rule, where x ′ is data from B and the loss derivatives are also evaluated on the batch; the structure of B will be further specified in examples, including any labels associated to x ′ . The associated wavefunction update is where Θ(x, x ′ ) is the neural tangent kernel (NTK) [36].
Since we are using a complex-valued neural network to represent quantum wavefunctions, we also see the appearance of Φ(x, x ′ ), which we call the Hermitian neural tangent kernel (HNTK), since it is Hermitian, Putting the wavefunction and its conjugate on equal footing, we write and for simplicity re-express it as a matrix ODE where Ω(x, x ′ ) is the block matrix in Eq. 8. We call Ω(x, x ′ ) the quantum state neural tangent kernel (QS-NTK), as it determines the gradient descent dynamics of NNQS, and more generally of complex functions. In general, it depends on parameters θ i and the initialization of ψ(x), though we will see in appropriate limits that the QS-NTK is deterministic and frozen during training. See also [59], which utilizes a quantum NTK in the context of variational quantum circuits, and appeared while we were finishing this work.
In practice, instead of representing the wavefunction as one complex output from the neural network, it is also common to have the neural network output the real and imaginary part of the wavefunction. In this case, we have the real imaginary NNQS representation Ψ RI := (ψ 1 , ψ 2 ) such that where Ω RI is the neural tangent kernel in real imaginary representation; see the Supplementary Materials. The QS-NTK is generic and may be applied to the various NNQS learning schemes, which correspond to the choice of loss function L. For Variational Monte Carlo study of ground states associated to a given Hamiltonian H, L = ⟨ψ|Hψ⟩ ⟨ψ|ψ⟩ . For quantum state tomography, with observables |x⟩ ⟨x| in a different basis rotation, L = − x log | ⟨x|ψ⟩ | 2 . For quantum state supervised learning with a target wavefunction ψ T , L = ||ψ − ψ T || 2 . In general, Eq. 10 is a nonlinear ODE with rich structure. In this work, we focus on the quantum state supervised learning setup, which yields a linear ODE. The study of other loss functions will left for future exploration.
QS-NTK for ∞-NNQS. Let ψ θ,N be a NNQS and Ω N (x, x ′ ) the associated quantum neural tangent kernel. For many architectures, the infinite QS-NTK Ω ∞ (x, x ′ ) is parameter-independent at initialization. This is established by the kernel trick, which turns Ω ∞ (x, x ′ ) into an expectation value over parameters via the law of large numbers. See the Supplementary Materials for a concrete example and discussion of generality, using NTK results. Utilizing this trick generally requires i.i.d. parameters, a property generally spoiled by training.
Fortunately, the initialization QS-NTK plays a special role that can resolve the issue. Consider the linearized model associated to Ψ(x), where θ 0 are the parameters at initialization and Ψ 0 (x) := Ψ(x)| θ=θ0 is the initialization wavefunction. The linearized model is the truncated first-order Taylor expansion of Ψ(x) around θ 0 ; we emphasize the model is linear in parameters, not inputs. The QS-NTK is which is a crucial conceptual result. It says that the QS-NTK Ω l associated Ψ l is the QS-NTK Ω of Ψ(x, x ′ ) at initialization, which is parameter-independent.
In summary, a ∞-NNQS Ψ with parameterindependent QS-NTK has a linearization Ψ l that evolves under gradient descent according to a parameterindependent, time-independent QS-NTK Ω l (x, x ′ ), with dynamics governed by Eq. 10, but with Ψ (Ω) replaced by Ψ l (Ω l ). This is a remarkable simplification.
Quantum State Supervised Learning.-We focus on quantum state supervised learning. This technique has important applications, such as initializing states for ground state and real time simulations, as well as understanding the representation power of the neural network architecture [60]. The loss function of quantum state supervised learning for a target wavefunction ψ T is the mean square loss Given a target quantum state ψ T and a batch of samples B, the dynamics Eq. 10 become where M = 0 1 1 0 , we have used Ψ * = M Ψ, and Ψ (Ω) have been replaced by Ψ l (Ω l ) in (10).
The exact solution to this linear ODE is given by where We use subscripts to denote input dependence, with x for a test point and Latin indices as batch indices. For instance, The initial wavefunction appears only in γ x (t). This analytic solution for an ∞-NNQS deserves comment. First, when the QS-NTK is positive definite (see the Supplementary Materials), the solution converges as τ → ∞ and the converged wavefunction agrees with the target on every train point. Therefore, if the batch B is the entire domain, the ∞-NNQS trained with the QS-NTK perfectly reproduces the target wavefunction. This is a NNQS analog of a major result from the NTK literature, which can be understood with geometric intuition  via projection from high-dimension spaces [61]. Equivalently, one can view ΩM as an effective Hamiltonian, in which case Eq. 13 is the analog of imaginary time evolution and converges to the ground truth. Second, for many architectures, the expectation value of the ensemble of initial wavefunctions is E[Ψ x (0)] = 0, in which case E[Ψ l,x (τ )] = µ x (τ ). In such a case, µ x (τ ) is the mean function of the ensemble at time τ , and therefore µ x (∞) is the mean function of the infinite ensemble of converged infinite neural network quantum states.
Either Ψ l,x (τ ) or µ x (τ ) could be utilized to make predictions relative to targets. This motivates two different losses, which uses converged mean for predictions, or which takes the average of losses for an ensemble of K linearized networks, trained to convergence, where Ψ the last term becomes the variance of the linearized model in the K → ∞ limit. Notice that Eq. 15 shows both L µ and L γ will converge both to zero on the training set in infinite time, which implies that ∞-NNQS will be perfectly optimized. For the test set, both L µ and L γ will converge to a finite value at infinite time, which provides an indicator of the performance of the ensemble of finite neural network, in practice. Numerical Experiments.-We perform numerical simulations for ∞-NNQS and an ensemble of finite-N NNQS in two important models in quantum many-body physics, which are the spin-1/2 transverse field Ising model and the Fermi Hubbard model For the transverse field Ising model, we consider H s on a 3 × 4 lattice with J = 0.1. The target state |ψ T ⟩ is prepared through |ψ T ⟩ = e −iHsτ |ψ 0 ⟩ with |ψ 0 ⟩ as the fully polarized state |+⟩ ⊗n and τ = 2.1. There are in total 4096 basis elements in the target wavefunction. For the Fermi Hubbard model, we consider H f on a a 3 × 4 lattice with 2 spin up fermions and 2 spin down fermions. The target state in the Fermi Hubbard model is prepared through |ψ T ⟩ = e −iH f τ |ψ 0 ⟩, where H f has U = 8, |ψ 0 ⟩ is the ground state of H f with U = 4 and τ = 2.1. There are 4356 basis elements in the target wavefunction. We choose |ψ T ⟩ in the above way such that they are complex-valued and related to the quench experiments with different coupling parameters in real time quantum dynamics.
For the numerical simulations, we consider two independent neural networks that represent the real part and the imaginary part of the wavefunction, ψ(x) = ψ 1 (x) + iψ 2 (x); this is the case of decoupled dynamics Since both models utilize 12 lattice sites, the input is encoded in a 12-d vector. For the transverse field Ising model, spin-up and spin-down configuration take values ±1. For the Fermi Hubbard model, the possibilities of a hole, spin-down, spin-up, and double occupancy take values ∈ {−1.5, −0.5, 0.5, 1.5}, respectively. For the training data set, we uniformly draw basis elements with dataset size 2400, 3200, 4000 from the target wavefunctions, and leave the rest (the basis complement) as the test dataset. For each experiment, we train an ensemble of 10 finite width neural network quantum states with full-batch gradient descent and compare with the quan-tum state neural network tangent kernel predictions. The learning rate is chosen to be 0.9 times the maximum NTK learning rate [62], which ensure that the finite networks evolve in a linearized regime. We do not need to train the ∞-NNQS because the exact solution Eq. 15 makes predictions for all epochs. All simulations are implemented with neural-tangents library [62]. Fig. 2 compare the training dynamics of finite NNQS and ∞-NNQS in both the transverse field Ising model and the Fermi Hubbard model. It is shown that the finite NNQS training dynamics agree rather well with the QS-NTK predictions. The training loss for the ∞-NNQS should drop to zero as τ → ∞, while the test losses will converge to a finite number, represented by the dashed line in the figure, which is the NTK prediction Eq. 18 in the infinite time limit. Fig. 3 show the total MSE loss over various training dataset sizes and finite width neural network quantum state ensembles. As the training batch size increases, the overall performances of different ensembles improve as expected. As the finite width increases, the performances of the neural network quantum state ensembles converge to the NTK prediction, which is the infinite width limit.
Conclusion.-In this work, we introduced infinite neural network quantum states (∞-NNQS). We demonstrated that ensemble average entanglement entropy bound may be computed in terms of neural network correlators. For appropriate ∞-NNQS, these calculations become tractable due to the NNGP correspondence. We demonstrate that certain architectures such as CosNet NNQS exhibit volume-law entanglement. We also developed the quantum state neural tangent kernel (QS-NTK) as a general framework for understanding the gradient descent dynamics of neural network quantum states (NNQS). Appropriate ∞-NNQS have parameterindependent QS-NTK at initialization, which in the linearized regime is frozen to its initialization value throughout training, leading to tractable training dynamics. In quantum state supervised learning, we proved that training a linearized ∞-NNQS with a positive definite QS-NTK allows for the exact recovery of any target wavefunction. In numerical experiments, we showed that these new techniques yield accurate predictions for the training dynamics of ensembles of finite width NNQS. Systematic studies from the infinite network literature [63] suggest that NTK or NNGP Bayesian training for ∞-NNQS may exhibit increasing performance over finite networks.
More broadly, our work provides theoretical insights on understanding the training dynamics of neural network quantum states. It also offers practical guidance for choosing neural network architectures: convergence rates during training depend on the spectrum of the QS-NTK, evaluated on the training data. This development also opens up various interesting research directions for understanding neural network quantum states optimiza-tion in other physics contexts, such as quantum state tomography and variational Monte Carlo study of neural network quantum states. Another interesting direction is to significantly generalize the NNQS architecture beyond the fully-connected case by using Tensor Programs [64], a flexible language for connecting general architectures with NTK limits. Recently, there are applications and generalizations of neural tangent kernels to quantum computation and quantum machine learning [59,[65][66][67], and it will be interesting to integrate QS-NTK into hybrid classical-quantum machine learning.
Acknowledgments. - Note Added : Refs. [59,66] on quantum neural tangent kernels in the context of quantum circuits were posted to arXiv four weeks prior to this manuscript, while our work focuses on the study of neural network quantum states.

I. Review of Neural Tangent Kernel
In this Section we wish to give a brief introduction to the neural tangent kernel (NTK) [36], a recent breakthrough in the theoretical machine learning community that provides new understanding of training neural networks via gradient descent. For the sake of pedagogy, we consider the case of a neural network with one-dimensional input and onedimensional output, though the analysis trivially extends to other dimensions. We also emphasize that the notation in this section is self-contained. To that end, consider a neural network with parameters θ. In general f θ is a "big" function, in the sense that it is a composition of simpler functions. Henceforth, we suppress the subscript θ and it is to be understood that the neural network depends on parameters θ.
The way in which f is composed out of simpler functions is known as the neural network architecture. At initialization, the parameters θ are drawn from some distribution θ ∼ P (θ) and then updated to achieve some objective, such as minimizing a scalar loss functional L.
We consider the case of training a neural network with gradient descent. In the continuous training time limit, gradient descent is given by with Einstein summation on I implied. Here l(x ′ ) is a loss associated to each train point x ′ that together sums up to L, and B is a batch of train points. Note that this is full-batch gradient descent, not stochastic gradient descent. By one more application of the chain rule, we have is a fundamental object appearing in the gradient descent dynamics, the NTK. Due to the sum over parameters, and the fact that modern neural networks have millions of parameters, this is a complicated object that -though fundamental -is in general difficult to compute. Conceptually, this is the kernel function that encodes how the function-space gradient descent update dl(x ′ )/df (x ′ ) at a train point x ′ gets communicated to the test point x.
Alternatively, one may think of it as the function that relates parameter-space and function-space gradient descent. A central observation of [36] is that the NTK simplifies significantly in an appropriate N → ∞ limit, where N is an appropriate width hyperparameter of the neural network. In that limit, the so-called frozen-NTK limit, Θ becomes a deterministic functionΘ that is training time independent. It is deterministic because the sum over parameters may be reinterpreted as an exceptation value E θ [·] by the law of large numbers. It is time-independent because wide neural networks evolve as linear models [36,37]. This frozen-NTK limit substantially improves the tractability of the training dynamics, and if l(x ′ ) is MSE loss, the dynamics are solvable.
We refer the reader to [36,37] for more details on this material; we have emphasized on the essentials. In our work we develop the theory to the case of neural network quantum states.

II. Quantum State Neural Tangent Kernel in Real Formulation, Mixing Kernels, and Decoupled Dynamics
It is illustrative to also consider the system in real imaginary formulation, writing the wavefunction as ψ(x) = ψ 1 (x) + iψ 2 (x). Defining we wish to determine the gradient descent dynamics of the real and imaginary parts. From Eq. 8, an appropriate change of variables gives d dt which we write compactly as In Ω RI (x, x ′ ), we note the appearance of the NTKs associated with ψ 1 (x) and ψ 2 (x), as well as a new object that we call the mixing kernel The mixing kernel is not symmetric in 1 and 2, causing the transpose Θ T 12 := Θ 12 (x ′ , x) to also appear in Ω RI . The NTK and HNTK of Ψ are related to those of Ψ 1 and Ψ 2 as where all are functions of (x, x ′ ), and QS-NTK Ω is related to Ω RI by Ω = RΩ RI R T (S12) The mixing kernel may be simplified by partitioning the set of parameters into subsets as the parameters of only ψ 1 , only ψ 2 , and shared parameters, respectively. Then the mixing kernel simplifies to i.e., it only depends on the shared parameters θ s . The mixing kernel affects dynamics: it mixes ∂L/∂ψ 2 (x ′ ) into the update for ψ 1 , and vice versa. We can achieve decoupling of the dynamics of ψ 1 and ψ 2 under an additional assumption. The pointwise loss may be decomposed as according to how different pieces depend on ψ 1 and ψ 2 . It is natural to call L 12 the mixing loss. Then we have Definition: Decoupled Dynamics. Let ψ 1 and ψ 2 be the real and imaginary parts of a NNQS with zero mixing kernel and mixing loss, Θ 12 = L 12 = 0. Then ψ 1 and ψ 2 evolve independently under gradient descent.
Decoupled dynamics arises, for instance, in the case of state recovery studied in quantum state supervised learning.

II. Deterministic Quantum State Neural Tangent Kernel: A Simple Example
We now demonstrate a simple architecture with deterministic QS-NTK at N = ∞. Consider a single-layer network of width N defined by an element-wise nonlinearity σ : R → R and parameters θ = {a, b, c} with a i , b i , c i ∼ N (0, 1) has When the quantities in the sums are i.i.d., the law of large numbers in the N → ∞ limit gives with no sum on i. In such a case Θ ∞ and Φ ∞ are parameter-independent, and therefore so is the QS-NTK Ω ∞ . While the i.i.d. criterion is not always satisfied, it usually is at initialization. This property holds for the NTK Θ for many architectures, but a similar analysis applies also to the HNTK Φ, and therefore the QS-NTK Ω should also be parameter-independent at initialization for many architectures. More generally, obtaining a deterministic QS-NTK is simple in the decoupled limit. There an ∞-NNQS is ψ = ψ 1 + iψ 2 , and if the NTKs Θ 1 and Θ 2 of ψ 1 and ψ 2 are deterministic, then so is the QS-NTK. One simply chooses ψ 1 and ψ 2 to have architectures that realize a deterministic NTK in the infinite limit; see [47] for deterministic NTKs in a wide variety of architectures. We expect that deterministic QS-NTKs are similarly general away from the decoupling limit, where one must additionally show that the mixing kernel becomes deterministic in the infinite width limit.

III. Quantum State Neural Tangent Kernel Spectra for Decoupled Dynamics
We have demonstrated that ∞-NNQS training via gradient descent is governed by a quantum state neural tangent kernel (QS-NTK), and that convergence is related to the spectrum of the QS-NTK. We now study the spectrum of the QS-NTK in the decoupled limit, which is sufficient for ensuring the existence of positive definite QS-NTK. In the decoupled limit, the real and imaginary parts of the wavefunction, ψ 1 (x) and ψ 2 (x), evolve independently under gradient descent according to their associated NTKs Θ 1 and Θ 2 . In this case, the QS-NTK is positive definite if Θ 1 and Θ 2 are positive definite.
Accordingly, we now review cases in which the limiting NTK is positive definite (PD); i.e., for f a real-valued function. Any kernel satisfying this constraint, when evaluated on a finite set of inputs, becomes a (Gram) matrix that is positive definite. Since in practice neural networks are trained on finite data sets, this Gram matrix is of particular importance, and is positive definite when the number of parameters θ i in the neural network is more than the number of inputs x. The results below are in regards to Eq. S19. Some of our results depend crucially on Bochner's theorem, which takes different forms depending on context. We use that of [68,69]: Theorem. (Bochner). A continuous translation-invariant kernel k(x, y) = k(x − y) on R d × R d is positive definite if and only if it is the Fourier transform of a non-negative measure.
with a k ∈ R, b k ∈ R d , k ∈ {0, . . . , d − 1}. The a k and b k are tunable hyperparameters set at initialization, and γ(v) lives on a hypersphere. Since cos(α − β) = cos(α) cos(β) + sin(α) sin(β), we have which, notably, is translation invariant. We now construct another neural network that uses the RFFs. Prepending the RFF map to g we arrive at f : R d → R D as f (v) = g(γ(v)). Let Θ g (x 1 , x 2 ) be the NTK associated to g. Inside f , g acts only on x of the form x = γ(v), i.e., points on the hypersphere. Restricted to the hypersphere, the NTK can often (e.g., if g in a multi-layer perceptron) be represented as a dot product kernel, Θ g (x 1 , x 2 ) = h g (x 1 · x 2 ). Then the NTK of f is given by [70] We see that by prepending g with RFFs to obtain f , the NTK associated to f is translation invariant. Additionally, convergence may be optimized by tuning the a k and b k , which in turn tunes the spectrum of Θ f ; [70] demonstrated this with strong success in concrete computer vision applications. We emphasize instead that this technique gives another angle on PD NTKs: given an architecture g with even input dimension, this construction yields a canonical architecture f with translation-invariant NTK that may be checked for positive-definiteness by Bochner's theorem, as we did for Gauss-net.
Case 3: Original Literature. The NTK was defined in [36], which also arrived at the first PD NTK. Let f be a deep fully-connected network with input dimension d and non-polynomial Lipschitz nonlinearity σ. Then the restriction of the NTK to the unit sphere S d−1 is PD.

IV. Details on Entanglement Entropy Calculation
Here we provide details on the calculation of entanglement entropy. To be specific, we consider Rényi-2 entropy calculation for n = 2 in Eq.
where x i,j AB := (x i A , x j B ). This is also known as the replica-trick [50,51]. The ensemble average ⟨S 2 ⟩ ≥ −logE θ Tr[ρ 2 θA ], where the right hand side can be computed by