Towards a theory of machine learning

We define a neural network as a septuple consisting of (1) a state vector, (2) an input projection, (3) an output projection, (4) a weight matrix, (5) a bias vector, (6) an activation map and (7) a loss function. We argue that the loss function can be imposed either on the boundary (i.e. input and/or output neurons) or in the bulk (i.e. hidden neurons) for both supervised and unsupervised systems. We apply the principle of maximum entropy to derive a canonical ensemble of the state vectors subject to a constraint imposed on the bulk loss function by a Lagrange multiplier (or an inverse temperature parameter). We show that in an equilibrium the canonical partition function must be a product of two factors: a function of the temperature and a function of the bias vector and weight matrix. Consequently, the total Shannon entropy consists of two terms which represent respectively a thermodynamic entropy and a complexity of the neural network. We derive the first and second laws of learning: during learning the total entropy must decrease until the systems reaches an equilibrium (i.e. the second law) and in the equilibrium an increment in the loss function must be proportional to an increment in the thermodynamic entropy plus an increment in the complexity (i.e. the first law). We calculate the entropy destruction to show that the efficiency of learning is given by the Laplacian of the total free energy which is to be maximized in an optimal neural architecture, and explain why the optimization condition is better satisfied in a deep network with a large number of hidden layers. The key properties of the model are verified numerically by training a supervised feedforward neural network using the method of stochastic gradient descent. We also discuss a possibility that the entire universe on its most fundamental level is a neural network.


Introduction
Despite of many attempts [1][2][3][4][5][6] the effectiveness of deep learning has so far no clear explanation. This is rather surprising given that a neural network is a very simple and a well-defined mathematical object [7][8][9]. What makes it difficult to analyze is that the deep neural networks are typically described with a very large number of parameters, e.g. weight matrix, bias vector, training data, etc. For such systems most of the analytical techniques are not very useful and one must rely on numerics. The situation is very similar to what happens in physics. Physical systems (both classical and quantum) can often be solved exactly when the number of degrees of freedom is small, but the problem becomes intractable when the number of degrees of freedom is large. Fortunately, there is a set of ideas which proved very useful for analyzing physical systems with many degrees of freedom. It is statistical mechanics. The main point of the present paper is to apply the methods of statistical mechanics to machine learning. In the remainder of this section, we will summarize the main results as it might help the reader to navigate through the paper.
In Sec. 2 we set the stage by defining a neural septuple (i.e. state vector, input/output projection operators, weight matrix, bias vector, activation map and loss function). The septuple is not equivalent to the standard neural architecture used in machine learning, but it does include such systems as a special limit. There are three main motivations to define these more general structures. First of all, we want to develop a unified treatment of different types of learning algorithms, i.e. supervised, unsupervised, etc. Secondly, we want the very structure of hidden layers to be a dynamical variable in addition to the weight matrices and bias vectors. And finally, we want to have a theoretical framework which is suitable for a statistical description.
In Sec. 3 we address the main problem of unsupervised learning, namely, what should be an appropriate loss function if the training dataset contains only input, but no output data. We claim that an answer can be obtained by defining a local error and a local objective for hidden neurons (or in the bulk) instead of a more conventional error for output neurons (or on the boundary). The boundary loss is usually given by a sum over errors on the boundary (i.e. over input/output neurons), but the bulk loss could be a sum over both local errors and local objectives in the bulk (i.e. over hidden neurons). A simple example of a local objective for a neuron is a binary classification of an incoming signal, and then an outgoing signal with values closer to lower-and upper-bounds are rewarded and values in-between are penalized.
In Sec. 4 we consider two statistical ensembles over state vectors: a micro-canonicaltype ensemble and a canonical-type ensemble. We expect that in the limit of a large number of neurons the two ensembles are equivalent, as is usually the case in statistical physics, but the latter ensemble (i.e. canonical ensemble) is a lot easier to handle analytically. Moreover, we show that the canonical ensemble can be derived from the Jaynes' maximum entropy principle [10,11]. The principle states that the probability distribution (in our case statistical ensemble) which best represents the current state of knowledge is the one with the largest Shannon entropy.
In Sec. 5 we define perhaps the most important object in statistical mechanics -a partition function. For a bulk loss with (or without) a quadratic local objective the canonical ensemble can be approximated by a Gaussian integral and then the (canonical) partition function can be calculated analytically. A minor complication is that the range of integration (which is set by the range of an activation function) is finite in contrast to the Gaussian integral whose range is infinite. Nevertheless, the problem can be solved by replacing the sharp cut-off on the boundaries of integration with a smooth Gaussian window function. In this section we also define an operatorĜ whose spectrum determines the canonical partition function and plays a central role in everything that follows.
In Sec. 6 we define a time-invariant state of equilibrium (or what we call a learning equilibrium) and show that in such state the partition function must factorize into product of two factors: a function of the temperature and a function of the bias vector and weight matrix. Among other things it implies that the total free energy is a difference of two terms: a familiar thermodynamic free energy and an unfamiliar product of the temperature and a complexity function. While the thermodynamic free energy is a function of only temperature and usually decreases, the total free energy is expected to increase with learning due to a decrease in the complexity function. This might sound odd, but, in fact, is a mare consequence of an openness of the learning system where the entropy is flowing out of the system.
In Sec. 7 we calculate the total entropy of the canonical ensemble and argue that in an equilibrium it must be a sum of a familiar thermodynamic entropy and of an unfamiliar complexity function which is directly related to a dynamical dimensional reduction of the state space. We then argue that the total entropy must decrease until the systems reaches an equilibrium (i.e. the second law of learning (7.6)) and in an equilibrium an increment in the loss function must be proportional to an increment in the thermodynamic entropy plus an increment in the complexity (i.e. the first law of learning (7.11)).
In Sec. 8 we calculate a non-equilibrium production of the Shannon entropy (of a probability distribution function over weight matrices and bias vectors) and argue that in an optimal neural architecture the entropy production must be maximized (or, more precisely, the entropy destruction should be minimized). This is in a complete agreement with the socalled stationary entropy production principle that was used in [14] to derive an approximate Schrödinger equation from a highly constrained stochastic process and in [15] to derive an approximate Einstein equation from non-equilibrium thermodynamics of the metric tensor. We then show that the rate of the entropy production is proportional to the Laplacian of the free energy in the configuration space of the weight matrices and bias vectors.
In Sec. 9 we used the criteria of minimizing the entropy destruction (or minimizing the negative Laplacian of the free energy) to derive an expected dynamics of a spectrum of operatorĜ in an optimal neural architecture. In particular, we show that most of the eigenvalues of the operator logĜ should remain near zero, a small fraction of the largest eigenvalues should move to positive values 0 and a very small fraction of eigenvalues should move to very small values 0. This implies that the effectiveness of a neural network can be translate into a skewness of the distribution of eigenvalues of logĜ, i.e. the larger the skewness the better a neural network is expected to perform. Then we show that the skewness in a deep architecture is much larger than in a shallow architecture which demonstrates why the deep architecture is preferred.
In Sec. 10 we discuss the main results of numerical experiments conducted using the TensorFlow Python library [16] and MNIST database of handwritten images [17]. Two different neural networks (a deep network with two hidden layers and a shallow network with a single hidden layer) were trained using the method of stochastic gradient descent and then the numerical data were analyzed in context of the analytical calculations carried out in the paper. More specifically, the training evolution of the bulk and boundary loss functions progressed as expected, the predicted dynamics of the spectrum of operatorĜ was established, and the anticipated relaxation of the complexity function towards equilibrium was confirmed.
And finally in Sec. 11 and Sec. 12 we discuss a possibility that the entire universe on its most fundamental level is a neural network. For such an ambitious proposal to actually work we claim that the three components: quantum mechanics, general relativity and macroscopic observers must all emerge from a microscopic neural network. For the time being, we leave aside the problem of observers (see, however, [33]) and study a possible emergence of quantum mechanics and general relativity. In particular, we show that approximate Schrödinger equation (see Sec. 11) and Einstein equations (see Sec. 12) can indeed emerge from a network with a large number of neurons not too far from a learning equilibrium.

Neural septuple
We start by introducing all of the essential ingredients of a neural network or, what we shall call, a neural septuple consisting of: (1) x, a state vector which describes the state of all neurons, (2)P in , an input projection which describes a subspace spanned by input neurons, (3)P out , an output projection which describes a subspace spanned by output neurons, (4)ŵ, a weight matrix which describes directed connections between all pairs of neurons, (5) b, a bias vector which describes biases in the inputs of all neurons, (6) f , an activation map which describes a non-linear transformation, and (7) H, a loss function which describes a learning objective of the entire network. Consider a collection of N neurons described by a column 1 state vector, x ∈ R N , whose components are real numbers, x i ∈ R, but one can also generalize the construction to complex numbers or other fields. There are three types of neurons: input neurons, hidden neurons and output neurons, and so it is convenient to define three subspaces of the state space: input subspace V in , output subspace V out and hidden subspace V hid . We shall also refer to the direct sum of the input and output subspaces, V in ⊕ V out , as a boundary and to the hidden subspace, V hid , as a bulk. A neural network is trained by specifying the components of x in the boundary subspace which represent only the input and output neurons. The boundary components can be described with two projection operators (or matrices): an input projection P in and an output projectionP out . These operators can be used to project a state vector x to either input subspace, i.e.P in x ∈ V in , output subspace, i.e.P out x ∈ V out , boundary subspace, i.e.
The neurons are connected into a neural network with connections described by a weight matrix,ŵ, which is also an adjacency matrix of a weighted directed graph with individual neurons representing the nodes of the graph. For the neural networks considered here the components of the weight matrix are assumed to be real numbers, w ij ∈ R, but one can also generalize the construction to complex numbers or other fields. In addition, for the so-called feedforward neural network with L layers (i.e. one input layer, one output layer and L − 2 hidden layers) the weight matrix is taken to be nilpotent, i.e. The state vector can only change when either a new training dataP in x ∂ ∈ V in are entered, or the new data propagate through the network The vector b ∈ R N is a bias vector and f : R N → R N is an activation map which acts separately on each component, i.e.
For the input neurons with no incoming connections to remain fixed, i.e.
an additional condition must also be imposed on the input bias, This condition is satisfied for a feedforward neural network, but need not be satisfied for more general learning systems. After a finite number of steps t the state vector x(t) may converge to a fixed state x(t) =x defined by a fixed point equation For example, in a deep feedforward neural network with L layers the fixed state would be reached after L − 1 steps, i.e. x(L − 1) =x, given that the condition on the input bias (2.8) is satisfied. For more general systems the state may or may not converge to a fixed point depending on the activation transformation (2.5) and initial conditions (2.4).
The final ingredient of a neural septuple is a loss function. In a feedforward neural network the loss function is usually defined by projecting the fixed statex to the output subspaceP outx ∈ V out and then by comparing the result with a desired output stateP out x ∂ ∈ V out . For example, one can define a loss function as a squared error of the output neurons, Since there is no error on the input neurons (2.7) we can also rewrite it as a squared error on all boundary (i.e. input and output) neurons For this reason, we shall refer to H ∂ as a boundary loss function.

Supervised vs. unsupervised
In the pervious section we defined a neural network as a neural septuple x,P in ,P out ,ŵ, b, f , H where x is a state vector of all (input, output and hidden) neurons,P in x is a state of only input neurons,P out x is a state of only output neurons,ŵ is a weight matrix between all pairs of neurons, b is a bias vector for all neurons, f (y) is an activation map and H(x, b,ŵ) is a loss function. A simple example of a loss function is the boundary loss (2.11) which is known to work very well in a supervised learning. Unfortunately, the boundary loss cannot be used in unsupervised systems where the output subspace is empty, V out = ∅, and thus the boundary loss is always zero, H = H ∂ = 0. 2 For this reason, in unsupervised systems (beyond auto-encoders) we must consider other loss functions which are, perhaps, more general than the boundary loss.
A key observation is that in equation (2.11) the boundary loss was due to a mismatch in the output conditions or (together with input conditions) in the boundary conditions, i.e.
(P in +P out )x = x ∂ , but the fixed point equation (2.9) was satisfied exactly. Alternatively, we can assume that the boundary conditions are satisfied exactly (P in +P out )x = x ∂ , but the fixed point equation is only approximate. Then we can define a bulk loss (as opposed to the boundary loss) as a sum of squares of errors in the fixed point equation, i.e.
wherex is the value of x at the minima of H(x, b,ŵ) subject to boundary conditions (P in + This is the simplest bulk loss 3 which is still zero for unsupervised feedforward neural networks, but can be easily generalized to functions which can be used in both supervised and unsupervised learning. The main idea is that, from the point of view of an individual neuron, a (more general) learning objective can be modeled as a minimization of a local error and at the same time a maximization of a local objective. It is convenient to think of the local error as a supervised quantity (e.g. (x i − f i ( j w ijxj + b i )) 2 in the bulk loss function (3.1)) and of the local objective as a (yet to be defined) unsupervised quantity. Then even if the local error is already at its minima (as is always the case for unsupervised feedforward neural networks) there is still another quantity which needs to be extremized, i.e. the local objective. This does not mean that the inclusion of the local objective would only benefit an unsupervised learning. Once an appropriate local objective is identified it can be incorporated into a (bulk or boundary) loss function to improve the convergence of a learning algorithm.
For example, the local objective might be a binary classification of an incoming signal j w ijxj +p j and then the values ofx i closer to lower-and upper-bounds should be rewarded and values in-between should be penalized. Such a classification objective can always be modeled with an appropriately chosen "potential" term for each neuron. For example if then the bulk loss function can be defined as By minimizing this loss function we accomplish both tasks: the minimization of the local error and maximization of the local objective (in this case the binary classification objective). Note that the "tachyonic potential" (3.3) does not lead to any runaway solutions if the range ofx i is bounded by the range of the activation function. For example, if the activation function is f (y) = tanh(y) then x i ∈ (−1, 1). More generally, any two (or more) neurons might have a common objective and then the potential term must also include "interactions", i.e.
In either case, according to (3.2), the corresponding learning objective remains the same, i.e. we must adjustŵ and b in such a way that for a given set of boundary condition (P in +P out )x = x ∂ the bulk loss function is minimized. What is, however, different is that the bulk loss function H(x, b,ŵ) is now given by (3.4) which contains both a local error (a supervised or a kinetic term) and a local objective (an unsupervised or a potential term). As a result, the corresponding bulk loss function is well-defined and (generically) non-zero for both supervised and unsupervised systems. Unfortunately, a solution of equation (3.2) is difficult to obtain exactly and so the statistical methods must be employed.

Statistical ensembles
There are basically two ways to proceed: experimental (based on numerics) or theoretical (based on statistics). We will start with statistical approach as it might assist us in numerical searches. For starters, consider a statistical ensemble of boundary conditions or, more precisely, a probability distribution p ∂ (x ∂ ) over components of the state vector in the boundary subspace x ∂ = (P in +P out )x. Such a distribution can, for example, be extracted from a training dataset. Then, instead of minimizing a loss function for individual boundary data, the learning objective could be to minimize an ensemble-averaged loss function, i.e.
where N ∂ ≤ N is the dimensionality of the boundary subspace. If we now extend the probability distribution into the bulk by defining wherex is given by (3.2), then the ensemble-averaged loss function is simply Of course, all that we did is moved the difficulty of calculatingx into p 0 (x), but that does not solve the main problem. It is still a computationally intensive task to calculate U 0 (b,ŵ) exactly and this is where statistical mechanics comes to rescue. The key idea is to replace the micro-canonical-type ensemble (4.2) with a canonical-type ensemble (4.10). Note that for sufficiently large systems (in our case large neural networks) one can often show that the two ensembles are equivalent (i.e. predictions are identical) but the canonical ensemble is much easier to handle analytically. Consider a statistical ensemble of neural networks, or a probability distribution p(x), over state vectors x. Let's say we do not know how the network was trained (i.e. which algorithm was used), but we do know that the ensemble-averaged bulk loss H(x, b,ŵ) was reduced to some fixed value, Then according to the principle of maximum entropy [10,11], the most reasonable distribution p(x) is a distribution which has the largest Shannon entropy subject to constraint (4.4). The maximization problem can be solved using the method of Lagrange multipliers. If we define a "Lagrangian" then at a maxima of L(p, β, ν) the variations with respect to p(x), β and ν must vanish Therefore, the maximum entropy distribution must be given by with Lagrange multipliers β and ν determined from the constraint (4.8) and normalization condition (4.9). In what follows we shall refer to the distribution (4.10) as a canonical ensemble.

Partition function
The partition function for the canonical ensemble (4.10) is defined as Z ≡ exp(1 + ν) and can be expressed as an integral over the state space where ... should remind us of any additional variables (e.g. m) which could determine the functional form of H. To calculate the partition function (5.1) for a bulk loss (3.4) we can approximate the integral with a Gaussian. This can be done by first expanding the activation function where the ensemble-averaged state vector is and a diagonal matrix of first derivatives of the activation function is Then to the first order in perturbation theory the bulk loss function is Next, we note that the domain of x is bounded by the range of the activation function. For example, if the activation function is f (x) = tanh(x) ∈ (−1, 1), then the partition function is The sharp boundaries can be approximated with a smooth Gaussian window function, i.e.
and then the overall partition function is a Gaussian and can be easily evaluated, The spectrum ofĜ is defined by an eigenvalue equation where λ i are the real eigenvalues and v i are the respected eigenvectors. Then the log of partition function is In the limit of a large number of neurons N , the average components are small a 2 i 1, the second term in (5.11) is subdominant in comparison to the first term and can be dropped. Then the partition function is given by (5.9) but without an exponential factor, i.e.
and the log of the partition function is Note, however, that this is a very rough estimate of the true partition function borne out of our statistical description, but the hope is that this approximation is rich enough to explain at least some aspects of machine learning.

Learning equilibrium
Given the partition function (5.1) the average bulk loss can be calculated by simple differentiation, where we have explicitly shown that U (β, b,ŵ) depends on the Lagrange multiplier β (or, what we shall call, an inverse temperature parameter). If the neural network was already trained for a very long time, then the weight matrix and the bias vector must be in a state which minimizes the average loss U (β, b,ŵ) and then its variations with respect toŵ and b must vanish, We shall call this state, a state of the learning equilibrium or just an equilibrium state. For a generic system, the degeneracy of an equilibrium state or the dimensionally of an equilibrium manifold (or the number of "Goldstone" modes) can be quite large, but is still much smaller than N . A very important property of an equilibrium, which follows from (6.2), is that the partition function must be a product of two terms or that the total free energy must decompose into a sum of two terms The first term is a familiar thermodynamic free energy and, as we shall argue in the following section, the second term is related to a complexity of the neural networks. However, the free energy obtained from (5.14) (with the local objectives parameter m set for simplicity to zero) is which does not in general decompose into a sum of two terms as in (6.4). This suggest that in an equilibrium some additional restrictions must be imposed on the eigenvalues λ i . One possibility (that we shall verify numerically) is that where N > is the number of eigenvalues λ i that are much greater than β −1 . Then the free energy can indeed be decomposed as in equation (6.4), and Recall that λ i 's are the eigenvectors ofĜ = Î −f ŵ T Î −f ŵ and, thus, λ i 's are functions of b andŵ.

Thermodynamics of learning
The total Shannon entropy of the canonical ensemble can be obtained from the canonical partition function (5.14), Just like the free energy, in a learning equilibrium (6.2), the entropy must also decompose into a sum of two terms, 2) The first term depends on only the inverse temperature parameter β and we shall refer to it as a thermodynamic entropy For the total free energy (6.6) it is given by As the learning progresses, the average loss, U (β), decreases, the temperature parameter, β −1 , decreases and, thus, according to (7.4) one might expect that the thermodynamic entropy, S 0 , should also decrease. However, it is not the thermodynamic entropy, S 0 , but the total Shannon entropy S (whose exponent describes accessible volume of the configuration space for x) should become smaller with learning. We shall call it the second law of learning (or perhaps the minus second law): Second Law of Learning: the total entropy of a learning system can never increase during learning and is constant in a learning equilibrium, In the long run the system is expected to approach an equilibrium state with the smallest possible total entropy S which corresponds to the lowest possible sum of the thermodynamic entropy, S 0 , and of the complexity function C(b,ŵ) that we shall discuss next. In a feedforward neural network the weight matrix,ŵ, is nilpotent (2.1) and, therefore, the eigenvalues of the operatorŵ are all zeros. This also implies that the eigenvalues of operatorÎ −f ŵ are all ones, but that does not tell us much about the eigenvalues ofĜ. On the other hand, the determinant ofĜ is simply related to the determinant ofÎ −f ŵ, If we assume that near equilibrium N > does not change significantly, then a decrease in C(b,ŵ) implies that the largest eigenvalues λ i β −1 log(λ i ) of the operatorĜ must increase and at the same time, according to (7.8), the smallest eigenvalues λ i β −1 log(λ i ) must decrease. Therefore, as the learning progresses, the operatorĜ becomes better and better approximated by eigenvectors v i with only largest eigenvalues, i.e.
This is what one might call a dynamical dimensional reduction of the state space (a subspace of dimension N > < N is sufficient to describe a state vector x), or a reduction in the complexity of interconnections between neurons (a subspace of dimension N 2 > < N 2 is sufficient to describe a weight matrixŵ) or a complexity of computations that a given neural networks performs (a subspace of dimension N 2 > < N 2 is sufficient to describe a linearized evolution operator (Î −f ŵ)). For this reason we shall refer to C(b,ŵ) as a measure of complexity or just complexity.
For a system in an equilibrium at constant temperature T = 1/β, variations of the free energy must vanish, dF = 0, and then (6.4) takes the from of the first law, or what we shall call the first law of learning (or perhaps the minus first law): First Law of Learning: in an equilibrium the increment in the loss function is proportional to the increment in the thermodynamic entropy plus the increment in the complexity dU = T dS 0 + T dC. (7.11) This law describes how the learning system should behave in a learning equilibrium, but in order to understand which neural architectures would be the most optimal we must take one step further and consider a non-equilibrium dynamics of the learning system.

Optimal architecture
Consider a family of bias vectors b(Q) and weight matricesŵ(Q) parametrized by dynamical parameters Q k 's where k ∈ (1, ..., K). Typically the number of parameters K is much smaller than N + N 2 (i.e. the number of parameters required to describe a generic vector b and a generic matrixŵ) and the art of designing a neural architecture is to come up with functions b(Q) andŵ(Q) which are most efficient in finding solutions. To make the statement more quantitative, consider an ensemble of neural networks described by a probability distribution p(β, Q) which evolves with "time" β according to where the parameters Q k 's evolve in the direction which maximizes the free energy The Shannon entropy of the distribution p(β, Q) with continuous variables Q (not to confuse with entropy S(β, b,ŵ) defined in the previous section) is, where M is a fixed normalization parameter. Large the entropy S(β), larger the accessible volume of the configuration space exp(S(β)), and therefore larger the rate with which new solutions for b andŵ can be found. Then an optimal architecture (describe by b(Q) and w(Q)) is the one for which the entropy destruction is minimized or, equivalently, the entropy production is maximized. We shall call it the principle of the minimum entropy destruction: Principle of Minimum Entropy Destruction: The path taken by an optimal learning system is the one for which the entropy destruction is minimized (or the entropy production is maximized).
Note that the principle is the opposite of the minimum entropy production principle [12,13] that is often used in context of non-equilibrium thermodynamics, but is consistent with the stationary entropy production principle that was recently used in context of emergent quantum mechanics [14] and emergent gravity [15]. In context of the learning systems, a useful expression for the entropy production can be obtained from (8.1), (8.2) and (8.3), where we assumed that p vanishes at the boundary and so the integrations by parts can be performed. If we choose the normalization parameter not to be too large M p −1 ∼ 2 N , then − log(M p) 1 and the entropy production is This integral equation can be rewritten as a local differential equation for the entropy density σ(β, Q) = −p(β, Q) log (M p(β, Q)) .

(8.7)
Its solution is given by an exponential where σ(0, Q) is determined from initial conditions at some fixed β and the Laplacian operator is defined as usual ∆ ≡ k To better understand the optimization condition we can choose the parameters Q i 's to be given by eigenvalues of the operatorĜ, i.e. Q i = λ i . Then the free energy (5.14) (with m set for simplicity to zero) can be approximated as and its Laplacian as Evidently, the Laplacian is always negative and, thus, the entropy density (8.11) must always decrease with learning, Therefore, to improve learning efficiency we must choose an architecture such that the Laplacian is as close to zero as possible and the entropy destruction is minimized (i.e. the principle of minimum entropy destruction). If we, once again, split all of the eigenvalues into large and small, then the Laplacian is approximately given by the number N < of smallest eigenvalues Note that while the largest eigenvalues λ i β −1 are responsible for reducing complexity of the already obtained solutions (6.7), the smallest eigenvalues λ i β −1 are responsible for searching for new solutions.

Deep vs. shallow
We are now ready to tackle one of the biggest mysteries of machine learning. Why do deep neural networks perform so well? We believe the answer is hidden in the free energy F . As we have argued in the previous section (8.11) the Laplacian ∆F describes the rate with which the entropy density σ decays and by minimizing, we maximize the efficiently of a neural network to find solutions (i.e. the principle of minimum entropy destruction). To solve the minimization problem, we can consider a neural network with a small fraction γ 1 of eigenvalues λ i ∼ λ and a larger fraction 1 − γ of large eigenvalues at λ i ∼ λ In this phase only a decreasing fraction, γ < (β + 1) −2 , of the small eigenvalues, log λ < 3 log − 1 2 + 1 4 + 1 β , can still move to even smaller values, but the motion is terminated when the smallest values, log λ 3 log − 1 2 + 1 4 + 1 β , reach the plateau. However, since detĜ = 1, it is expected that the sum of the largest eigenvalues would continue to grow and This is what happens in an optimal system, however, by enforcing an architecture on a neural network (e.g. deep or shallow) we impose additional constraints on the free energy (and on its Laplacian) which limits the ability of a network to explore the space of solutions. For example, in a feedforward neural network with many input neurons and few output/hidden neurons, most of the eigenvalues are set to log λ i = 0 and only a small fraction of eigenvalues is free to move to smaller and larger values. Clearly, the larger the number of the dynamical eigenvalues a neural architecture has, the better it is for learning. Therefore, in order to compare "apples to apples" we must first fix the number of the dynamical eigenvalues, and then look for an architecture which is flexible enough to support a skewed distribution of log λ i 's. As we have argued in the previous paragraph, want we want is to be able to start with a single peak with all eigenvalues at log λ i ∼ 0 and then to gradually grow a second peak with the largest eigenvalues log λ i 0 > − log β which is to be balanced by the smallest eigenvalues log λ i < − log β that are dragged to smaller and smaller values. With this respect, a better architecture is the one which supports a larger variance µ 2 ≡ T r logĜ 2 (9.5) and a more skewed distribution or a more negative (9.6) In a feedforward neural network the weight matrix is nilpotent (2.1) as well as a product of the weight matrix,ŵ and a diagonal matrix of first derivatives,f , i.e.
where L is the number of layers. For starters, consider a vary shallow network with no hidden layers (i.e. L = 2) and thus ŵ Tf 2 = f ŵ 2 = 0. Then there must exist functions F 1 (x) and F 2 (x) such that and, therefore, would have a product of unequal number of F 1 (f ŵŵ Tf )f ŵ and F 2 (ŵ Tf f ŵ)ŵ Tf terms which must be traceless. As we shall see in the next section, even with a single hidden layer (i.e. L = 3) the second powers of operatorsf ŵ andŵ Tf are very small and the skewness is still very small µ 3 ≈ 0. What this means is that the effective number of dynamical eigenvalues is half of what it would have been if all eigenvalues were free to move without having to respect the symmetry of the distribution. However, as we keep adding more hidden layers the skewness of distribution grows larger, the eigenvalues become less constrained and the efficiency of learning is greatly improved. This might be why the deep learning is so efficient: hidden layers are essential for larger skewness µ 3 and, as a result, for less negative Laplacian ∆F (and a slower decay of the entropy density σ) which we claim is necessary for efficient learning.

Numerical experiments
A direct numerical calculation of the distribution p(x) is a computationally intensive task, but the main advantage of our statistical description is that the canonical ensemble (4.10) can be viewed as purely phenomenological object. Then the main problem should be to come up with a model of the bulk loss function, H(x, b,ŵ), which best describes the canonical ensemble and, consequently, the canonical partition function, Z(β, b,ŵ), and other thermodynamic quantities. On the other hand, the analysis of the preceding sections already suggests certain forms of the bulk loss function and of the partition function which we can easily verify numerically. In this section, we will check to what extent a feedforward neural network can be modeled by the bulk loss function without local objectives (i.e. (3.4) with m = 0) or with the corresponding thermodynamic quantities: (a) average bulk loss (estimated in (7.5)), complexity function (estimated in (6.7)), (c) thermodynamic entropy (estimated in (7.4)), where N > is the number of eigenvalues λ i 's much larger than β −1 . In addition, we will verify the expected dynamics of the eigenvalues and the anticipated dependence of the variance (9.5) and skewness (9.6) parameters on the performance of the neural networks obtained in the previous sections. All of the numerical experiments were carried out using the TensorFlow Python library [16] and MNIST database of handwritten images [17]. Unfortunately, in the TensorFlow library the hidden layers are not dynamical and must be set prior to training. Nevertheless, we were able to obtain the desired results by running two different programs: the first one with two hidden layers (or what we shall call a "deep" neural network) and one with a single hidden layer (or what we shall call a "shallow" neural network). In the deep network we used an input layer with 784 neurons, the first hidden layer with 40 neurons, the second hidden layer with 20 neurons and the output layer with 10 neurons; and in the shallow network we used the same number of neurons on the input and output layers (i.e. 784 and 10), but only a single hidden layer with 60 neurons. Altogether there are N = 784 + 40 + 20 + 10 = 784 + 60 + 10 = 854 neurons in each neural network and so the state vectors x and the bias vectors b are 854dimensional vectors. The weight matrixŵ has 854 × 854 components w ij , but most of them are zero due to the predetermined architecture of hidden layers. The input layers represent a handwritten image of a number from 0 to 9 which is passed to 28 × 28 = 784 input neurons. One of the 10 output neurons is to be activated only if the corresponding number is on the image. The activation function on all neurons is f (y) = tanh(y) and so the diagonal matrix f has diagonal elements given by The training was carried out using the method of stochastic gradient descent for 30, 000 epochs with 6, 000 samples in the training dataset. On Fig. 2 we plot (log of the ensemble-averaged) bulk loss U = H (blue line) and (log of the ensemble-averaged) boundary loss U ∂ = H ∂ from the deep neural network. As expected, the bulk loss remains few orders in magnitude smaller than the boundary loss, but both functions decrease with time. For training the neural network we used the (more familiar, but less general) boundary loss function H ∂ , but a similar result is expected even if the (less familiar, but more general) bulk loss function H would have been used instead. On Fig. 3 we plot the bulk loss log U vs. the boundary loss log U ∂ from the same deep network. Note that at late times the bulk loss keeps decreasing while the boundary loss remains almost constant (inside of red oval on Fig. 3). This behaviors continue for about 10, 000 (!) training epochs until the network finally finds a better solution and the boundary loss jumps to a smaller value (inside of green oval on Fig. 3). And then essentially the same behavior continues, i.e. bulk loss decreases monotonically, but boundary loss makes sudden jumps. There is a simple explanation of the phenomena. The boundary loss is stuck in a saddle point with a large number of nearly flat directions for a very long time before it finds a way out. As the learning progress the system keeps moving along the flat directions and that does not reduce the boundary loss considerably, but the bulk loss and, as we shall see shortly, complexity keep decreasing with roughly the same pace. This shows that the bulk loss function has a lot fewer flat directions and with this respect a much better loss function.
In addition, as we have argued in Sec. 3, it can be defined beyond supervised systems, e.g. for unsupervised learning. On Figs. 4 and 5 we plot histograms of the dynamical eigenvalues of operator logĜ from the, respectively, shallow and deep networks. As expected, most of the eigenvalues remain near origin and only a fraction of eigenvalues is displaced significantly from log λ i ∼ 0. The distribution of the eigenvalues in the shallow network is almost completely symmetric, the skewness after 10000 epochs is µ 3 ≈ −2.7 × 10 −10 , the learning efficiency is suppressed and, as a result, the variance remains small µ 2 ≈ 1.84. In contrast, the distribution of eigenvalues in the deep network is not symmetric, the skewness after 10000 epochs is more negative µ 3 ≈ −1.7, the learning efficiency is enhanced and the variance grows larger µ 2 ≈ 2.44. There is a clear gap between the smallest and larger eigenvalues (marked by red arrows on Fig. 5) which can be seen after 10000 epochs at log λ ≈ −3.0, after 1000 epochs at log λ ≈ −2.5 and may be even after 100 epochs at log λ ≈ −2.3. This gap is expected to be at unstable maxima defined by equation (9.4) which implies that after 100 epochs β ≈ 1.37, after 1000 epochs β ≈ 1.40 and after 10000 epochs β ≈ 1.47. In the shallow network the gap cannot be clearly identified since it is closer to the origin and the inverse temperature parameter β is smaller. Also note that while the smallest eigenvalues move to smaller values, to satisfy (7.8) the largest eigenvalues must move to larger values. Recall, that the largest eigenvalues describe the complexity of the network (10.2) and the increase of the largest eigenvalues represents a decrease in the complexity of the network.
In the previous paragraph, we estimated the values of β by identifying a gap (marked by red arrows on Figs. 5) between the smallest eigenvalues and the rest. However, as one can see from Fig. 5 the smallest eigenvalues log λ i < − log β keep moving to smaller values together with − log β. This suggests that (instead of using equation (10.2)) we can try to define an approximate complexity by a sum of a fixed number of the largest eigenvalues, where it is assumed that the eigenvalues are ordered λ 1 ≥ λ 2 ≥ ... ≥ λ N . On Fig. 6 we plot    two (limiting) complexities by summing over twenty largest eigenvalues, C 20 (b,ŵ), (blue line) and by summing over all but twenty smallest eigenvalues, C 834 (b,ŵ), (red line). Evidently, up to an additive constant, the behavior of both curves is similar and so either one (or anyone in-between) can be used to study a relaxation of the system towards equilibrium. On Fig.  7 we plot the same complexities, but as a function of the bulk loss log U . Both functions are nearly linear with slopes of order one: C 20 (b,ŵ) ≈ 769 + 1.25 log U for the blue line and C 834 (b,ŵ) ≈ 772 + 1.41 log U for the red line. In fact the linear dependance is in agreement with the second law of learning (7.6) which states that the total entropy must decay with learning. Recall the total entropy is a sum of the complexity (10.2) and thermodynamic entropy (10.3) (which scales linearly with log U whenever N > remains constant) and so it is expected that both quantities would decay with roughly the same rate.
There are certainly many other numerical experiments that we could have done, but it should already be evident that the statistical description developed in the paper might actually shed light on what is happening behind scenes in machine learning. We now switch to the most speculative part of the paper by asking if the entire universe on its most fundamental level could be described by a neural network.

Entropic mechanics
Quantum mechanics is a remarkably successful paradigm for modeling physical phenomena observed on a wide range of scales ranging from 10 −19 meters (i.e. high-energy experiments) to 10 +26 meters (i.e. cosmological observations.) The paradigm is so successful that it is widely believed that on the most fundamental level the entire universe is governed by the rules of quantum mechanics and even gravity should somehow emerge from it. This is known as the problem of quantum gravity that so far has not been solved, but some progress has been made in context of AdS/CFT correspondence [21][22][23], emergent gravity [15,24,25], quantum entanglement [26][27][28] and holographic complexity [18][19][20]. 4 Although extremely important, the problem of quantum gravity is not the only problem with quantum mechanics. The quantum framework also starts to fall apart with introduction of observers. Everything seems to work very well when observers are kept outside of a quantum system, but it is far less clear how to described macroscopic observers in a quantum system such as the universe itself. The realization of the problem triggered an ongoing debate on the interpretations of quantum mechanics, which remains unsettled to this day. On one side of the debate, there is an increasing number of proponents of the many-worlds interpretation claiming that everything 4 There seems to be an interesting connection between thermodynamics of learning systems (see Sec. 7) and thermodynamics of holographic complexity. In Ref. [19] the authors showed that the quantum computational complexity of a holographic states on the anti-de Sitter boundary is dual to an action over a Wheeler-de Witt patch in the bulk. In the learning systems, the complexity function C(b,ŵ) is the quantity which best describes the complexity of a boundary state (for example, in a feedforward network C(b,ŵ) is the complexity of a neural network which maps the input boundary data to output boundary data with the smallest error) and a thermodynamic free energy A(β) is the quantity which best describes the state of the local degrees of freedom in the bulk. According to the First Law of learning (7.11) the two quantities are in fact related if not in an absolute sense, then at least in a relative sense, i.e. dA = dU − T dS0 = T dC.
(11.1) Also note that even the Second Law of learning (7.6) is somewhat related to the recently proposed Second Law of complexity of quantum states [20] with the main difference that during learning the complexity C decreases not on its own, but together with the thermodynamic entropy S0. This suggests that there might be a deep connection between learning systems and holographic systems that we still have not figured out.
in the universe (including observers) must be governed by the Schrödinger equation [29], but then it is not clear how classical probabilities would emerge. One the other side of the debate, there are proponents of the hidden variables theories [30], but there it is also unclear what is the role of the wave-function in a purely statistical system. It is important to emphasize that a working definition of observers is necessary not only for settling some philosophical debates, but for understanding the results of real physical experiments and cosmological observations. In particular, a self-content, paradoxes-free definition of observers would allow us to understand the significance of Bell's inequalities [31] and to make probabilistic prediction in cosmology [32].
To resolve the apparent inconsistency (or incompleteness) in our description of the physical world, we shall entertain a (not so new) idea of having a more fundamental theory than quantum mechanics. A working hypothesis is that on the most fundamental level the dynamics of the entire universe is described by a microscopic neural network. If correct, then not only macroscopic observers should emerge from the microscopic neural network (see, for example, [33]), but, more importantly, the equations of quantum mechanics and general relativity should correctly describe an emergent dynamics of the corresponding learning system. Our main goal in this section is to show that quantum mechanics (or, more precisely, Schrödinger equation) indeed provides a good description of an optimal neural network not too far from an equilibrium and we postpone the discussion of general relativity until the next section. Note that most of the results in the remainder of this section were originally obtained in Ref. [14], but in a slightly different context and with slightly different assumptions.
Recall that equation (8.1) describes evolution of a probability distribution p(β, Q) where Q k 's (for k ∈ (1, ..., K)) parametrize the weight matrixŵ(Q) and the bias vector b(Q). The equation works well away from an equilibrium, but at an equilibrium the first derivatives of the free energy vanish dQ k dβ = α ∂F ∂Q k ∼ 0 (11.2) and the dominant contribution to the entropy production comes from "diffusion". Then we can study evolution of the system along the equilibrium manifold of dimension K −K K (i.e. the number of "Goldstone" modes is K −K) defined by a (degenerate) maxima of the free energy F (Q). More formally, the equilibrium manifold can be defined by a set of equations Θk(Q) = 0 (11.3) wherek ∈ (1, ...,K). These equations are satisfied only along the maxima of the free energy, but in our statistical description we shall only insist that they are satisfied on average, i.e. Note that instead of studying the dynamics in β we switched to a new parameter t (e.g. the number of training epochs) for which the dynamics can be described by a Fokker-Planck equation with a time-independent diffusions coefficient D. For simplicity, we also assume that the diffusion coefficient D does not depend on Q and so no factor ordering problems arise. Then the main problem is to solve for p(t, Q) subject to constraints (11.4) which is exactly the type of problems considered in Ref. [14]. There it was shown that the constrained dynamics can be described by an approximate Schrödinger equation, with Lagrange multipliers playing the role of phases, if the number of the constraints is large (i.e.K ∼ K) and the so-called principle of stationary entropy production is satisfied: Principle of Stationary Entropy Production: The path taken by the system is the one for which the entropy production is stationary.
However, in Sec. 8 we argued that in an optimal architecture a closely related principle (of minimum entropy destruction) should be satisfied and, therefore, all that we need to assume is that the microscopic neural network has an optimal architecture. The optimization problem can then be solved by combining into a single functional S(p, Φ) two terms: the total entropy production from time t = 0 to time t = T and constraints (11.4) imposed by time-dependent Lagrange multipliers Φk(t)'s, i.e.
After integrating by parts and ignoring the boundary terms (assuming that p vanishes at the boundaries of integration) we obtain where the wave-function is defined as Evidently, upon varying (11.7) we arrive at a Schrödinger equation whose solutions extremize the functional S(p, Φ) or, in other words, describe a trajectory in the configuration space which minimizes entropy destruction. (See Ref. [14] for details). Therefore, we conclude that quantum mechanics (or at least Schrödinger equation) can in fact emerge from a microscopic neural network with an optimal architecture near equilibrium.
-25 -Now we turn to gravity. 5 If the microscopic neural network has an optimal architecture then it is still the case that the principle of minimum entropy destruction (or, the more general, principle of stationary entropy production) should be satisfied and so the relevant quantity to extremize is still (11.6), which contains both entropy production (first term) and constraints (second term). What is, however, different is that we must allow for the larger system to be further away from a learning equilibrium and so the number of constraintsK can be much smaller than the number of parameters K. In other words, the dimensionality of the equilibrium manifold (or, if you wish, the number of symmetries) remains very high. This implies that the probability distribution p(t, Q) should have a higher degree of symmetry and thus can be parametrized p(ĝ, Q) with a (relatively) small number of auxiliary parameterŝ g(t). For example, if the probability distribution is parametrized by Gaussian distributions, p(ĝ(t), Q) ∝ exp − k,k g kk (t)Q k Q k , then the optimization problem is to findĝ(t) and Φ(t) which extremize (11.6), The first term represents the entropy production and depends only onĝ and the second term represents constraints and depends on bothĝ and Lagrange multipliers Φ. This is what might be happening on a microscopic level, but our task in this section is to only develop a phenomenological model gravity based on what we already know about general relativity. In gravitational theories the dynamical degrees of freedom are described by a metric tensor g µν (x) and other fields Φ(x) all of which are functions of four spacetime coordinates x = (x 0 , x 1 , x 2 , x 3 ). From that perspective a better parametrization of the probability distribution is given by g µν (x) and of the Lagrange multipliers by Φ(x). Then (12.1) can be expressed phenomenologically as S(g µν , Φ) = d D+1 x |g| − 1 2κ R(g µν (x)) + Λ + d D+1 x |g| L(g µν (x), Φ(x)) (12 .2) where, as before, the first term represents the entropy production and the second term represents the constraints. Several comments are in order. First of all, in equation (12.1) the parameterĝ was a finite dimensional matrix, but in equation (12.2) the parameter g µν (x) is a continuous function and so at best it is an approximate mapping which should break down at some UV scale (e.g. Planck scale). Secondly, even if the metric tensor g µν (x) is defined only on some very fine-grained lattice, there is a sense of distance between gravitational degrees of freedom which is not present in a neural network. This would be true for a general learning system, but we expect that for a clever choice of local objectives the weight matrixŵ (which is also an adjacency matrix describing the strength of connections between neurons) could be attracted towards a three-dimensional lattice (see [35] for a possible mechanism) and then the space-time locality would emerge. Thirdly, any lattice-like structure would break a general covariance which is known to be a very precise symmetry of nature. Therefore, we must 5 As far as we know the only attempt to describe gravity in terms of quantum neural networks was made in Ref. [34]. However, the main difference with our approach is that the microscopic neural network considered here is not quantum, but statistical. On the other hand, as we have argued in Sec. 11, the quantum behavior of the microscopic neural network is expected and so it is possible that the two systems are equivalent. also assume that the local objectives of neurons are such that the general covariance would emerge on large scales (see [36] for a possible mechanism), but exactly how this might work is presently unknown.