Strong ensemble nonequivalence in systems with local constraints

The asymptotic equivalence of canonical and microcanonical ensembles is a central concept in statistical physics, with important consequences for both theoretical research and practical applications. However, this property breaks down under certain circumstances. The most studied violation of ensemble equivalence requires phase transitions, in which case it has a `restricted' (i.e. confined to a certain region in parameter space) but `strong' (i.e. characterized by a difference between the entropies of the two ensembles that is of the same order as the entropies themselves) form. However, recent research on networks has shown that the presence of an extensive number of local constraints can lead to ensemble nonequivalence even in the absence of phase transitions. This occurs in a `weak' (i.e. leading to a subleading entropy difference) but remarkably `unrestricted' (i.e. valid in the entire parameter space) form. Here we look for more general manifestations of ensemble nonequivalence in arbitrary ensembles of matrices with given margins. These models have widespread applications in the study of spatially heterogeneous and/or temporally nonstationary systems, with consequences for the analysis of multivariate financial and neural time-series, multi-platform social activity, gene expression profiles and other Big Data. We confirm that ensemble nonequivalence appears in `unrestricted' form throughout the entire parameter space due to the extensivity of local constraints. Surprisingly, at the same time it can also exhibit the `strong' form. This novel, simultaneously `strong and unrestricted' form of nonequivalence is very robust and imposes a principled choice of the ensemble. We calculate the proper mathematical quantities to be used in real-world applications.


I. INTRODUCTION
In statistical physics, systems with different constraints can be described by different ensembles. For example, systems with fixed energy can be described by the microcanonical ensemble, where all microscopic configurations have precisely the same value of the energy and are equiprobable, thereby modelling large isolated systems. In this case, the energy is treated as a 'hard' constraint enforced separately on each configuration. By contrast, systems with fixed temperature (which is the 'dual' thermodynamic quantity conjugated with the energy) can be described by the canonical ensemble, where individual microscopic configurations can have different values of the energy and are assigned different probabilities, but in such a way that the average value of the energy coincides with the one defining the corresponding microcanonical ensemble [1]. This ensemble represents systems that can exchange energy with their environment, and the energy is in fact treated as a 'soft' constraint which is enforced only as an ensemble average.
When the size of the system is finite, the two ensembles are necessarily different. However, in the simplest and most traditional situation, the microcanonical description as a function of the energy becomes equivalent with the canonical description as a function of the temperature in the thermodynamic limit (i.e., when the number of particles in the system tends to infinity). This phe-nomenon is called ensemble equivalence (EE) and is a basic concept in statistical mechanics as already established by Gibbs [2]. The property of EE justifies the replacement of the (typically unfeasible) asymptotic calculations in the microcanonical ensemble with the corresponding (much easier) calculations in the canonical ensemble, i.e. to choose the ensemble based on mathematical convenience.
However, over the past decades, the breakdown of EE has been observed in various physical systems, including models of gravitation, fluid turbulence, quantum phase separation, and networks [3][4][5]. When the system is under ensemble nonequivalence (EN), the microcanonical description can no longer be replaced by the canonical description in the thermodynamic limit. In this situation, many assumptions and calculations that are based on EE in statistical mechanics do not hold anymore. Thus checking for the breaking of EE is important for both practical applications and theoretical research. Quantitatively, EN can be defined as a nonvanishing relative entropy density between the microcanonical and canonical probability distributions of microscopic configurations [5,6]. This is equivalent to a nonvanishing difference between the canonical and microcanonical entropy densities [7]. Technically, this is the so-called measure-level EN, which (under mild assumptions) has been shown to coincide with other definitions as well [6]. Importantly, the traditional criterion for EE based on the vanishing of the relative canonical fluctuations of the contraints has been recently found to break down when the latter are local in nature and extensive in number [8].
Indeed, for the most studied systems in statistical physics, the number of constraints defining the ensembles of interest is finite. Traditional physical examples are global constraints such as the total energy and the total number of particles. Non-physical examples of systems under global constraints have also been considered, e.g. networks with given total numbers of edges and triangles [9]. In order to observe the breakdown of EE in these systems, one typically needs to introduce longrange interactions implying the non-additivity of the energy and possibly associated with the onset of phase transitions [10] (in the example of networks, the underlying mechanism is a sort of 'frustration' in the simultaneous realizability of the desired numbers of edges and triangles [9]). In this form of EN, the relative entropy between canonical and microcanonical ensembles is of the same order as the canonical entropy itself [6,9]. This is what we will refer to as a 'strong' form of EN. At the same time, this form of EN is also 'restricted', because it is confined to a selected (e.g. critical) region of the space of parameters. Outside this region, EE is restored.
Recently, a new manifestation of EN has been observed in a different class of network ensembles, where a constraint is enforced on the degree (number of links) and/or the strength (total weight of all incident links) of each node [5,8,11,12]: unlike systems with global constraints, in these models the number of constraints grows linearly in the number of nodes. This crucial difference implies that, at variance with the more 'traditional' situation described above, the onset of EN in this class of models is completely unrelated to phase transitions and is instead the result of the presence of an extensive number of local constraints [7,13,14]. This situation is by far less studied, because systems with local constraints are not the traditional focus of statistical physics and have attracted attention only recently as models of complex systems with built-in spatial heterogenetity [15] and/or temporal non-stationarity [16]. In this different form of EN, the relative entropy between microcanonical and canonical ensembles is, at least for the cases studied so far, of lower order (i.e. subleading) with respect to the canonical entropy. For this reason, we may refer to this situation as a 'weak' (i.e. weaker than the one found in the presence of phase transitions) form of EN. However, this form of EN is 'unrestricted', precisely because it is not confined to specific values of the control parameters and holds in the entire parameter space. Rather than a property of a phase (or a phase boundary), in this case EN appears to be an intrinsic property of the system itself. In these models, no parameter value can restore EE.
The above results indicate that, so far, EN has manifested itself either in a 'strong but restricted' form (under a finite number of global constraints, but in presence of phase transitions), or in a 'unrestricted but weak' form (under an extensive number of local constraints, but without phase transitions). Clearly, a number of questions remain open. How general is the manifestation of EN under local constraints, both in terms of the under-lying mechanism and in terms of the strength of the resulting effect? Besides networks, can the breaking of EE be observed in additional systems characterized by local constraints? If so, do these systems necessarily exhibit only the weak form of EN, or can the strong form appear as well? Finally, is there a (possibly modified) way to exploit the canonical ensemble in order to bypass the challenge of unfeasible microcanonical calculations even when EE breaks down, i.e. even when the two ensembles can no longer be treated interchangeably according to mathematical convenience?
In this paper, we will address these problems by exploring the effects of the presence of an extensive number of local constraints on more general ensembles than the ones that have been considered so far to model random networks with given node degrees [5,11,13,17]. In particular, we consider the general setting where each of the n units of the system has a number m of 'state variables' (or 'degrees of freedom'), and where constraints are defined as sums over these state variables. Surprisingly, besides confirming the onset of an unrestricted form of EN in the thermodynamic limit where n diverges, we also find its simultaneous manifestation in strong form. This happens when each element of the system retains only a finite number m of degrees of freedom in the thermodynamic limit. For brevity, we will denote this situation as the 'strong and unrestricted' form of EN. To the best of our knowledge, this finding provides the first evidence that EN, even in its strong form, does not need phase transitions and can appear in the entire parameter space as an intrinsic property of the system, if the latter is subject to an extensive number of local constraints. This simultaneously 'strong and unrestricted' form of EN is the most robust among the ones studied so far. Spatial heterogeneity and temporal non-stationarity are simple candidate mechanisms that can lead to this phenomenon.
To emphasize the general and important consequences of this form of EN for a diverse range of practical applications, we consider generic ensembles of random matrices with fixed margins. These ensembles, which include matrices with 0/1 (or equivalently ±1) and nonnegative integer entries subject to global or local constraints, arise for instance in studies of multi-cell gene expression profiles [18], multiplex (online) social activity [19], multi-channel communication systems [20], complex networks [21], and multivariate time series in finance [16], neuroscience [22] or other disciplines. Our results imply that, in many practical situations, the assumption of EE is incorrect and leads to mathematically wrong conclusions. For the benefit of the aforementioned applications, we compensate for the 'disconnection' between the two ensembles by calculating explicitly the correct canonical and microcanonical quantities of interest via a generalized relationship that is either analytically computable or asymptotically determined by the covariance matrix of the constraints in the canonical ensemble. These calculations represent a practical tool for properly dealing with the consequences of EN in all real-world sit-uations.

A. Matrix ensembles
A discrete n × m matrix ensemble is a set G of available configurations for an n×m integer-valued matrix G, endowed with a suitably chosen probability distribution P (G) over such configurations, such that G∈G P (G) = 1. An entry of the matrix is denoted by g ij (with 1 ≤ i ≤ n, 1 ≤ j ≤ m). We distinguish two main cases, the binary case where g ij takes one of the two values {0, 1} and the weighted case where g ij takes a value in the set {0, 1, 2, . . . } of non-negative integer values. The number n of rows in each matrix represents the number of elements (i.e. the size) of the system being modelled. The number m of columns represents instead the number of state variables, or degrees of freedom, for each element.
In general, each matrix G can represent one of the possible states of a (large) real-world system. For instance, no longer focuses on the specific microstate G * (which becomes not empirically accessible), but rather on the macrostate defined by a collection of microscopic configurations compatible with the empirical value C * ≡ C(G * ) of a certain observable quantify C(G) playing the role of a constraint. The choice of C(G) determines the probability distribution P (G) over G conditional on our knowledge of C * . In other words, it determines how our estimate of the microstate of the system concentrates around the compatible configurations once we observe C * . Intuitively, before anything is obserbed, P (G) is uniform on G.
In ordinary statistical physics, the quantity C(G) is typically scalar (e.g. the total energy) or at most lowdimensional (e.g. a vector containing the total energy and the total number of particles), reflecting a few global conservation laws applying to a large homogeneous system at thermodynamic equilibrium. However, in models of complex systems C(G) can be high-dimensional, as it may encode a large number of local constraints reflecting separate conservation laws imposed by spatial heterogeneity and/or temporal non-stationarity. For instance, if G * is the observed configuration of a complex network with n nodes (i.e. the empirical n × n adjacency matrix), it is well known that the knowledge of purely global properties such as the overall number of links is insufficient in order to produce a statistical ensemble of networks with properties similar to those found in G * . Indeed, enforcing only the total number of links produces the popular Erdős-Rényi random graph model, whose topological properties are way too homogenous as compared with those of real-world networks. By contrast, if the number of links of each node is enforced separately (as in the so-called configuration model ), the resulting ensemble of graphs is found to successfully replicate many higher-order empirical topological properties [21]. As another example, if G * represents a set of synchronous time series produced by the n components of a non-stationary system observed over m time steps, then the statistical properties of these time series will change over time. As a result, overall time-independent constraints will not be enough in order to produce ensembles of multivariate time series with properties close to those of G * , and time-dependent (i.e. local in time) constraints will in general be needed [16].
In our setting, we consider the general case where the distribution P (G) defining the (binary or weighted) matrix ensemble is induced by a K-dimensional vector C(G) of constraints imposed on the matrices. We will assume that the K constraints are all non-redundant, e.g. they are not trivial copies or linear combinations of each other [7]. We will consider both global and local constraints. As global constraint we will consider the scalar quantity t(G) defined as the total sum of all the entries of the matrix G, i.e. t(G) = n i=1 m j=1 g ij . The number of constraints in this case is K = 1 and the 'empirical' value of t will be denoted as t * ≡ t(G * ). As The collection of all possible states of the system is the set of all such matrices. Typically, real-world systems have a strong heterogeneity or nonstationarity. This empirical fact implies that their possible states are not sufficiently characterized by the knowledge of a single global constraint (t: solid orange box). More informative ensembles can be constructed by specifying one-( r: solid blue boxes) or two-sided ( r, c: solid blue and dashed red boxes) local constraints. local constraints, we will consider two possibilities: onesided local constraints and two-sided local constraints. A one-sided local constraint is the n-dimensional vector r(G) where the entry r i (G) = m j=1 g ij (i = 1, n) represents the sum of the entries of the matrix G along its i-th row. The number of constraints is in this case K = n and the empirical value of r will be denoted as r * ≡ r(G * ). A two-sided local contraint is a pair of vectors ( r(G), c(G)), where r(G) is still the n-dimensional vector representing the n row sums of G, while c(G) is the m-dimensional vector representing the m column sums of G, i.e. where each entry c j (G) = n i=1 g ij (j = 1, m) is the sum of the entries of G along its j-th column. The number of constraints is therefore K = n + m and the empirical value of the pair ( r, c) will be denoted as ( r * , c * ) ≡ ( r(G * ), c(G * )). A visual illustration of these constraints for possible data structures of practical interest is shown in Fig. 1.
Purely global constraints lead to completely homogeneous expectations for the entries of the matrices in the ensemble. This result follows intuitively from symmetry arguments, and will be confirmed explicitly in the specific cases considered later. By contrast, local constraints lead to different expectations for entries in different rows and/or columns. Since, as we mentioned above, realworld complex systems are generally very heterogeneous in space and/or time, the only models that can capture the main features of such systems are those constructed from (one-or two-sided) local constraints. This is very important because, as we will show, it is precisely in pres-ence of local constraints (of either type) that the property of EE breaks down. This result implies that spatial heterogeneity and/or temporal non-stationarity might be natural origins for the breaking of EE.

C. Soft constraints: canonical ensemble
Any constraint, whether global or local, can be enforced either as a soft constraint (canonical ensemble) or as a hard constraint (microcanonical ensemble). We start with the case of soft constraints, i.e. when one imposes that the ensemble average is fixed to a specific value C * . The functional form of the resulting canonical probability P can over G is found by maximizing Shannon's entropy functional (where the subscript n indicates that the entropy is calculated for given n), subject to the condition C = C * . The result [23] of this constrained maximization problem is the parametric solution where θ is a vector of Lagrange multipliers coupled to the constraint C, the Hamiltonian H(G, θ) = θ · C(G) is a linear combination of the constraints, and the partition function Z( θ) = G∈G e −H(G, θ) is the normalization constant.
The numerical values of the canonical probability are found by setting where θ * is the unique parameter value that realizes the 'soft' constraint where the symbol · θ denotes an ensemble average with respect to P can (G| θ), i.e.
Equivalently, the unique value θ * is the one that maximizes the log-likelihood function where G * is the empirical configuration, or equivalently any configuration that realizes the empirical constraint exactly, i.e. such that C(G * ) = C * [23]. The uniqueness of θ * follows whenever L * ( θ) can be differentiated at least twice [7], as we confirm below for all the models considered in this paper. Inserting Eq. (4) into Eq. (2), we obtain the value of the canonical entropy where we have omitted the dependence on n to simplify the notation. The last equality is very useful, as it show that S * can can be calculated by simply evaluating P * can (G) on the single configuration G * [7].

D. Hard constraints: microcanonical ensemble
In the case of hard constraints, one requires that each individual configuration realizes the value C * . This means that the 'soft' constraint in Eq. (5) is replaced by the much stricter constraint for each allowed configuration G. The microcanonical probability P mic is found by enforcing this stronger requirement, while still maximizing the entropy S n [P ] defined in Eq. (2). The result is the uniform distribution where Ω C * is the number of configurations in G realizing the 'hard' constraint in Eq. (9). The corresponding microcanonical entropy is obtained by inserting Eq. (10) into Eq. (2): which is also known as Boltzmann entropy. Crucially, in order to define the microcanonical ensemble it is necessary that Ω C * > 0, i.e. that there is at least one configuration realizing the constraint. In other words, the value of C * should be realizable in (at least) one single configuration, and not only as an ensemble value. This requirement is not strictly necessary for the canonical ensemble (even though our interpretation of C * as the 'empirical' value makes the requirement always natural). In any case, since in this paper we are going to study the (non)equivalence between the two ensembles, we need both of them to be well defined in order to be compared, for a given value of C * . Therefore we are going to assume that the value of C * , irrespective of the ensemble considered, is always realizable by at least one configuration, i.e. such that Ω C * > 0.
Notably, calculating Ω C * (especially in presence of many constraints and because of the discrete nature of the problem of interest for us) can be a complicated enumeration problem. Therefore the microcanonical ensemble is typically much more difficult to deal with mathematically than the canonical ensemble. For this reason, if the property of EE holds, one prefers to operate in the canonical ensemble and work out its asymptotics in the limit of large system size, trusting that the result would return the correct asymptotics for the microcanonical ensemble as well. The above approach is at the core of many stardard calculations in statistical mechanics textbooks, where the property of EE is typically assumed to hold in general (at least in absence of phase transitions and long-range interactions). However, when EE breaks down, this approach will lead to mathematically incorrect results. We will study this problem in detail, for the ensembles considered, in the rest of the paper. To do so, we first need to define what we mean by thermodynamic limit.

E. The thermodynamic limit
We will consider the thermodynamic limit defined as n → +∞, i.e. when the size of the system diverges. However, the limit is not completely defined until we also specify how both m and C * behave as n grows.
First of all, we consider two possibilities for the behaviour of m as n diverges: • m remains finite as n → ∞: in this case, we have indicates a quantity that has a finite limit if divided by x as n → +∞, i.e. O(x) is asymptotically of the same leading order 1 as x; • m diverges as n → ∞: in this case, for simplicity and realism we assume that m can diverge at most as fast as n, i.e. m is at most O(n); it is indeed difficult to imagine a physical situation where the number m of state variables characterizing each of the n units grows faster than the number n of units themselves.
In simple words, the above assumptions mean that the number of state variables should be either (asymptotically) independent of the number n of units being added to the system (as in the case of 'intrinsic' observations, e.g. for multivariate time series) or at most proportional to n (as in the case of 'relational' observations, e.g. for networks). We will show that these two situations lead to very different asymptotic results in terms of the strength of EN. Importantly, the requirement that m grows at most proportionally to n implies that the number K of both one-sided (K = n) and two-sided (K = n + m) local constraints is always extensive, i.e. K = O(n) which grows linearly in the size n of the system, irrespective of the behaviour of m. A separate, equally important consideration concerns the scaling of the value of the constraint C * in the thermodynamic limit n → +∞. Also here, we distinguish between two situations that we denote as the sparse and the dense regimes.
• We define sparse matrices those for which each of the m column sums (irrespective of whether such sums are chosen as constraints) is finite in the thermodynamic limit, i.e. c * j = O(1) (j = 1, . . . , m). This implies that, in the canonical ensemble, the expected value of any entry g ij of the matrix G is on average O(1/n); correspondingly, in the microcanonical ensemble the allowed matrices are dominated by zeroes (whence the name 'sparse matrices'). Note that for the row sums one has r * i = O(m/n) for all i. If m grew slower than n, these row sums would vanish as n → +∞, which would imply that asympotically no microcanonical configuration would realize the local constraints. Since we require Ω C * > 0 (see above), this means that in the sparse case we necessarily need m = O(n). Consequently, • By contrast, we define dense matrices those for which each of the m column sums (again, irrespective of whether they are chosen as constraints) diverges proportionally to n in the thermodynamic 1 Note that the 'big-O' notation we use here is not always used with the same meaning throughout the literature: some authors prefer the 'big-Θ' notation Θ(x) to indicate a quantity that is of the same leading order as the argument x, and the 'big-O' notation to indicate only an upper bound for it. limit, i.e. c * j = O(n) (j = 1, . . . , m). In the canonical ensemble, the expected value of g ij is therefore O(1), which makes the allowed matrices in the corresponding microcanonical ensemble 'dense'. The row sums are now r * i = O(m) and we have r * i /m = O(1) (for all i) and t * /mn = O(1). In this case, we consider m as either remaining finite, in which case we have m = O(1), r * i = O(1) (i = 1, . . . , n) and t * = O(n), or diverging proportionally to n (see above), in which case we have m = O(n), r * i = O(n) (i = 1, . . . , n), and t * = O(n 2 ).
• Note that, in principle, in the weighted case we may even consider a sort of superdense regime where some of the individual entries of the matrix diverge in the thermodynamic limit. This possibility is related to a Bose-Einstein condensation concentrating a finite fraction of the total weight t * of the matrix in a finite number of entries [8]. However, we will not consider this extreme case here for simplicity, as it would not arise in most real-world applications.
Combined with the scaling of C * , the behaviour of m as a function of n in the thermodynamic limit can determine different asymptotic regimes, and in particular lead to the weak or strong form of EN. The strong form, for the cases considered below, turns out to be possible in the regime where the matrices are dense and m is finite as n → +∞.

F. Ensemble (non)equivalence
There are various ways to mathematically define the property of ensemble (non)equivalence. These include the notions of EE in the thermodynamic, macrostate and measure sense which, under mild assumptions, can be proven to be equivalent [6]. We will adopt the definition in the measure sense, which states that the ensembles are equivalent if the relative entropy (which is the Kullback-Leibler divergence for given n between the microcanonical and canonical entropies and is guaranteed to be non-negative [24]), when rescaled by n, vanishes in the thermodynamic limit [6], i.e. if the specific relative entropy vanishes: or equivalently where o(x) indicates a quantity that goes to zero when divided by x as n → +∞. Importantly, it can be shown [5,7] that The inequality S n [P * mic ||P * can ] ≥ 0, which is a general property of the relative entropy, implies therefore S * can ≥ S * mic and indicates the presence of an 'extra entropy' in the canonical ensemble. This extra entropy is due to the fact that, while in the microcanonical ensemble the constraint C is a deterministic quantity fixed to the value C * through the hard constraint introduced in Eq. (9), in the canonical ensemble it is a random variable fluctuating around the expected value C * as dictated by the soft constraint defined in Eq. (5). With respect to the canonical ensemble, the hardness of the constraint in the microcanonical ensemble implies additional dependencies (i.e. smaller entropy) among the entries of G. The definition of EE in Eq. (13) states that, if the extra entropy S n [P * mic ||P * can ], once divided by n, vanishes in the thermodynamic limit, then the ensembles are equivalent.
From Eq. (15) it is clear that Eq. (13) is equivalent to the condition or in other words to the asymptotic (for n large) relation This implies i.e. Ω C * is approximated by e S * can up to a subexponential (in n) correction factor. The above asymptotics is used in statistical mechanics textbooks whenever the property of EE is believed to hold, i.e. in absence of phase transitions or long-range interactions. When EE does not hold, Eq. (18) breaks down. In this case, the extra entropy in the canonical ensemble grows at least as fast as n. Recent research has shown that this breakdown can happen even in complete absence of phase transitions, hence also in situations where EE was typically believed to hold. Here we are going to show that, additionally, the breakdown can occur with previously undocumented strength, i.e. the extra entropy can grow as fast as the entropy itself.
Combining Eqs. (15) and (11), one obtains the following exact generalization of Eq. (18), valid irrespective of whether EE holds: Clearly, the above expression reduces to Eq. (18) in case of EE, i.e. when Eq. (14) holds. Although exact, Eq. (19) is not very useful unless one can calculate S n [P * mic ||P * can ] explicitly. An equivalent exact expression, which only requires the knowledge of P * can and is again valid even when EE does not hold, has been derived [7]: (where +π −π dψ k ). We will confirm that the above expression provides the exact result in cases where the complex integral can be calculated explicitly and Ω C * can be evaluated independently via combinatorial enumeration. Indeed, Eq. (20) highlights a beautiful connection between canonical and microcanonical probabilities through an extension to complex numbers.
When the integral in Eq. (20) cannot be calculated directly, it is still possible to use a saddle-point technique leading to [7] where Σ * is the covariance matrix of the K constraints in the canonical ensemble, whose entries are defined as and {λ * k } K k=1 are the eigenvalues of Σ * . We recall that covariance matrices are positive-semidefinite, so all their eigenvalues are non-negative. If λ * k is finite, then the quantity O(1/λ * k ) in Eq. (21) cannot in general be calculated explicitly, although it generates a correction that does not change the leading order of Ω C * and S * mic . If λ * k is infinite (i.e., if it diverges in the thermodynamic limit), then O(1/λ * k ) will vanish asymptotically and we have 1 + O(1/λ * k ) = 1 + o(1). This implies that, if all the eigenvalues of Σ * diverge, then Eq. (21), when inserted into certain expressions, can lead to an exact result. This includes the case of local constraints, for which K diverges in the thermodynamic limit. We will therefore discuss the asymptotic behaviour of the eigenvalues of Σ * in each of the examples considered later.
Equation (21) generalizes Eq. (18) to the case where EE does not necessarily hold. Note that our initial assumption that the K constraints are non-redundant implies that λ * k > 0 for all k, i.e. Σ * is positive-definite [7].
Keeping this assumption also in the thermodynamic limit (as ensured by our choice of both global and local constraints defined above), we note two consequences. First, since Eq. (23) shows that Σ * is the Hessian matrix of second derivatives of −L * ( θ), the fact that Σ * is positivedefinite implies that θ * is a unique global maximum for L * ( θ) [7], confirming what we had anticipated previously. Second, the product in Eq. (21) is at most of the same order as the denominator. Therefore, in full generality, we can exploit Eq. (21) to rewrite Eq. (19) as where we have defined [7] α n ≡ ln det(2πΣ * ) = 1 2 We can now make three important considerations. First, Eq. (24) means that showing that the speed of growth of S n [P * mic ||P * can ] with n can be calculated explicitly through Eq. (25) using the knowledge of Σ * , which requires only the canonical ensemble. This is useful when microcanonical calculations are unfeasible. Second, if K is finite, or if K diverges but all (except possibly a finite number of) the eigenvalues of Σ * diverge, then the product inside Eq. (21) gives a subleading contribution to S n [P * mic ||P * can ], which therefore has the same asymptotic behaviour as α n : This result, which is stronger than Eq. (26), means that in such a case one can obtain exact estimates of quantities that depend on S n [P * mic ||P * can ], using only the knowledge of α n . Third, Eq. (26) shows that the definition of EE given by Eq. (14) coincides with which, again, can be ascertained by evaluating only Σ * and avoiding any microcanonical calculation. Indeed, Eq. (28) can be formulated as an equivalent definition of EE in the measure sense [7]. If α n grows faster than o(n), then the system is under EN.

III. WEAK AND STRONG ENSEMBLE NONEQUIVALENCE
In this section we illustrate the main results, i.e. we identify systems for which the breaking of EE occurs in a form that is at the same time 'strong' and 'unrestricted' and we calculate the relative entropy in various such systems. To this end, we first make some general considerations leading to a rigorous definition of 'strong' EN and subsequently study specific examples within our matrix ensembles.
A. Relative entropy ratio Equation (21) reveals that the asymptotic behaviour of Ω C * depends on that of K and of the eigenvalues of the covariance matrix Σ * . We can indeed convince ourselves of this fact by looking at results of previous studies from a novel perspective.
Specifically, if K = o(n) and if we exclude phase transitions, then Eq. (21) leads to Eq. (18), i.e. the ensembles are equivalent. This includes the traditional situation where one has a finite number of constraints, as well as more complicated cases where the number of constraints is subextensive (e.g. random graphs with constraints on a subextensive subset of node degrees [17]). In order to break EE in this case, one needs phase transitions corresponding to singularities of the partition function [6]. For instance, in the case of graphs with fixed numbers of edges and triangles (or wedges) [25], there is a region in parameter space where one gets S n [P * mic ||P * can ] = O(n 2 ) and therefore Ω C * = e S * can −O(n 2 ) . Since also S * can and S * mic are O(n 2 ) in this case, it follows that (note that in general S * can ≥ S * mic due to the nonnegativity of the Kullback-Leibler divergence and to Eq. (15), therefore O(S * can ) is necessarily the leading order). This is what we have previously referred to as a form of EN that is 'restricted' (i.e. valid only in a certain region in parameter space arising from a phase transition and outside which EE is restored) but 'strong' (i.e. where the relative entropy is of the same order as the entropy itself).
If K = O(n), then Eq. (18) is in general no longer valid. For instance, in the case of sparse graphs with fixed degrees (K = n), all the eigenvalues of Σ * are finite in the thermodynamic limit [5,13]; one indeed obtains S n [P * mic ||P * can ] = O(n) [5] and hence Ω C * = e S * can −O(n) . Note that in this case the product in Eq. (21) (which in general cannot be calculated exactly) is of the same order as the denominator and should be taken into account. In the case of dense graphs with fixed degrees (again K = n), all the eigenvalues of Σ * are instead O(n) [13]; one indeed obtains S n [P * mic ||P * can ] = O(n ln n) [13] and hence Ω C * = e S * can −O(n ln n) . The product in Eq. (21) is in this case negliglible with respect to the denominator, which can be calculated exactly. In any case, since S * can and S * mic are still O(n 2 ) for both sparse and dense networks with fixed degrees, these situations correspond to i.e. to what we have defined 'weak' EN. On the other hand, this type of EN is not associated with phase transitions (which are indeed absent in the mentioned examples of graphs with fixed degrees) and is therefore 'unrestricted', i.e. valid in the entire parameter space. The above considerations suggest that, in order to rig-orously define the strength of EN, we may define the ratio between the relative entropy and the canonical entropy, calculated for fixed n, and consider its limit as n → +∞, i.e.
For brevity, we will call R n the relative entropy ratio and R ∞ the limiting relative entropy ratio. Note that the inequality S * can ≥ S * mic ≥ 0 implies 0 ≤ R n ≤ 1 for all n > 0. The condition characterizing our notion of strong EN in Eq. (29) coincides with R ∞ being strictly positive. The value of R ∞ in that case quantifies exactly the asymptotic proportionality between S n [P * mic ||P * can ] and S n [P * can ], which is otherwise left unquantified by Eq. (29) alone. We will therefore adopt the strict inequality (which in turns implies the breakdown of Eq. (28), the converse being in general not true) as our definition of strong EN. By contrast, the condition characterizing our notion of weak EN in Eq. (30) can be rephrased as the equality R ∞ = 0. Note that one may have R ∞ = 0 also in cases where the ensembles are equivalent. We will therefore adopt the condition R ∞ = 0, in conjunction with the violation of Eq. (28), as our definition of weak EN. Note that our discussion following Eq. (21) implies that, if all but at most a finite number of the eigenvalues of Σ * diverge, then the exact value of R ∞ can be retrieved by replacing S n [P * mic ||P * can ] with α n given by Eq. (25), i.e. using only the canonical covariances between the constraints, without microcanonical calculations.
Note that Eq. (19) implies So, in presence of strong nonequivalence (R ∞ > 0), Ω C * is of strictly smaller order compared with the ordinary estimate in Eq. (18). This is actually due to the canonical ensemble having much bigger entropy than the microcanonical one: indeed, Eq. (15) implies and, inverting, Note that the factor 1/(1 − R n ) can be arbitrarily large since R n can be arbitrarily close to 1.
Given the above definitions of 'weak' and 'strong' EN in terms of the limiting relative entropy ratio, in what follows we will consider the specific ensembles of matrices introduced in the previous section, under both global and local constraints, and calculate the value of α n and R ∞ in each case.

B. Global constraints
As already discussed, ensembles of (binary or weighted) n × m matrices with a global constraint are defined by requiring that the single quantity t(G) = n i=1 m j=1 g ij takes, either 'hardly' or 'softly', a specific value t * ≡ t(G * ). For this simple choice of the constraint, both S * can and S * mic can be calculated exactly. This allows us to check that the complex integral in Eq. (20) indeed provides the exact value of Ω t * . Moreover, we can confirm the correctness of the asymptotic formula in Eq. (21). All these approaches show that for both binary and weighted matrices with a global constraint the canonical and microcanonical ensembles are equivalent.

Binary matrices under a global constraint
Let us start with the case when the global constraint t * is imposed on binary matrix ensembles characterized by g ij ∈ {0, 1}. The calculation of the canonical entropy S * can is straightforward (see Appendix) by first calculating the likelihood and then looking for the value θ * that maximizes P can (G * |θ) or, equivalently, realizes the soft constraint t θ * = t * . The result is Using Eq. (8), we can then easily evaluate S * can from Eqs. (37) and (38) as (39) The calculation of the microcanonical entropy S * mic is in this case even simpler than that of the canonical one, since the number Ω t * of configurations realizing the hard constraint t(G) = t * is simply the number of ways in which t * 'ones' can be placed in mn available positions, i.e. the binomial coefficient Ω t * = mn t * . This implies Importantly, it is possible to confirm that, upon extending the argument of the likelihood to the complex domain and calculating P can (G * |θ * + iψ), the integral formula in Eq. (20) returns a value of Ω t * that produces the exact value of the microcanonical entropy S * mic given in Eq. (40): where the (instructive) calculation justifying the last equality is reported in the Appendix. Combining the expressions for S * mic and S * can into Eq. (15), we obtain the relative entropy between the two ensembles: In this simple example, the inequality S n [P * mic ||P * can ] > 0 clearly arises from the presence of dependencies among the entries of G in the microcanonical ensemble and the absence of such dependencies in the canonical one. Indeed, while in the microcanonical ensemble the hard constraint t(G) = t * makes all the entries of G mutually dependent, in the canonical ensemble the soft constraint t θ * = t * leaves each entry g ij independent and identically (Bernoulli-)distributed with probability (see Appendix). Consequently, while in the microcanonical ensemble the constraint t(G) is a deterministic quantity fixed to the value t * , in the canonical ensemble t(G) is a random variable with expected value t * and variance where Σ * is the only (recall that here K = 1) of the covariance matrix Σ * introduced in Eq. (23). As discussed in Subsection II F, S n [P * mic ||P * can ] and Σ * are asymptotically related through Eq. (26), and the (non)equivalence of canonical and microcanonical ensembles is decided by the asymptotic behaviour of these two quantities. We will confirm both results in the particular case under consideration here. However, for compactness, we do this in conjunction with the weighted case, after introducing the latter below.

Weighted matrices under a global constraint
We now consider the case when the global constraint t * is enforced on weighted matrices where g ij is a nonnegative integer. As we show in the Appendix, in the canonical ensemble the likelihood can be calculated as and is maximised by the parameter value (note the change of sign with respect to the binary case) realizing the soft constraint t θ * = t * . The canonical entropy is therefore In the microcanonical ensemble, the number Ω t * of configurations realizing the hard constraint t(G) = t * coincides with the number of so-called weak compositions of the positive integer t * into exactly mn parts, i.e. the number of ways of writing the positive integer t * as the sum of an ordered sequence of mn nonnegative integers (note that two sequences that differ in the order of their terms represent different configurations). This number is given by the negative binomial In this case as well, one can confirm that the integration of the complex quantity P −1 can (G * |θ * + iψ) as specified in Eq. (20) produces precisely the same value of Ω t * used in Eq. (48) (see Appendix), thus retrieving the exact entropy The relative entropy S n [P * mic ||P * can ], calculated using Eq. (15), equals Again, the origin of a non-zero relative entropy lies in the presence of dependencies among all the entries of G in the microcanonical ensemble, where they are coupled by the hard constraint t(G) = t * , and in the absence of such dependencies in the canonical ensemble, where each entry g ij is independent and now geometrically (see Appendix) distributed with probability As a consequence, while in the microcanonical ensemble the constraint t(G) is fixed to the constant value t * , in the canonical ensemble it is a random variable with expected value t * and variance

Ensemble equivalence for matrices under a global constraint
We can now study, in a combined fashion, the (non)equivalence of the canonical and microcanonical ensembles of both binary and weighted matrices with a global constraint t * . To this end, we preliminarly notice that the reason why the quantity k+l−1 l is called negative binomial is the fact that it can be formally rewritten as the following binomial coefficient with negative signs: The above relation allows us to conveniently rewrite the relative entropy for the weighted case appearing in Eq. (50) as (54) Upon comparison with the corresponding Eq. (42) valid in the binary case, we can express the relative entropy in general as where the superscript "+" applies to binary matrices (note that t * ≤ mn in this case) and the superscript "−" applies to weighted matrices. Note that the expression for the weighted case can be formally retrieved by changing the sign of m in the expression valid for the binary case.
As we discussed in Subsection II F, checking for (non)equivalence requires studying the asymptotic behaviour of the relative entropy. In this case, we can calculate the asymptotic behaviour of S ± n [P * mic ||P * can ] explicitly from the exact expression given by Eq. (55). Note that, in both the sparse and dense case (see Subsection II E), t * and mn diverge in the thermodynamic limit. We can therefore apply Stirling's formula to Eq. (55), which yields (57) For purely pedagogical reasons, we check that the asymptotic behaviour found above is consistent with the one we would retrieve by using the expansion in Eq. (21), which leads to Eq. (26) and reduces the problem of the calculation of S ± n [P * mic ||P * can ] to that of its leading order α n . To this end, we note that in this case the matrix Σ * , being a 1 × 1 matrix, coincides with its only eigenvalue where we have used Eq. (44) for binary (+) and Eq. (52) for weighted (−) matrices. Therefore which has indeed the same leading order as Eq. (57), thereby confirming the correctness of the saddle-point calculation. As an even stronger result, we are under the conditions for which Eq. (27) holds, a relationship that can be confirmed by comparing Eqs. (57) and (59). It should also be noted that, since t * diverges in the thermodynamic limit, so does (λ * ) ± and Eq. (21) leads to which is precisely what we get by applying Eq. (56) to the binomial and negative binomial coefficients appearing in the exact expression for Ω ± t * in the binary and weighted case respectively.
As stated in Eq. (28), checking whether the ensembles are equivalent boils down to checking whether α n = o(n). Note that the only effect of the asymptotic scaling of t * is that the quantity 1 − t * /(±mn) in Eq. (59) converges to 1 in the sparse case t * /mn = O(1/n) and to a different, but still finite and positive constant in the dense case t * /mn = O(1) (see Subsection II E). Therefore in both cases we have α n = O(ln t * ) = O(ln mn). This implies α n = o(n) independently of the asymptotic behaviour of m. This result shows that, in presence of a global constraint, both binary and weighted matrices are under EE, irrespective of the scaling of t * and m. This finding confirms, in a generalized setting, the result obtained for networks with a given total number of links [5]. Since EE is preserved, we avoid the calculation of the limiting relative entropy ratio R ∞ defined in Eq. (32) in this case.

C. One-sided local constraints
We now consider ensembles of binary and weighted n×m matrices with one-sided local constraints, i.e. under the requirement that the n-dimensional vector r(G) with entries r i (G) = m j=1 g ij (i = 1, n) takes a specific value r * ≡ r(G * ). Note that, unlike the case of global constraints, here the number of constraints is extensive. As in the case with global constraints, it turns out that both S * can and S * mic can still be calculated exactly. Therefore we can again confirm the correctness of both the exact integral formula in Eq. (20) and the asymptotic expansion in Eq. (21). Despite these extensions are mathematically straightforward, we find a deep physical difference with respect to the case with global constraints: the presence of an extensive number of local constraints implies the breaking of the equivalence of canonical and microcanonical ensembles for both binary and weighted matrices. The calculation of the limiting relative entropy ratio R ∞ allows us to quantify the strength of nonequivalence and also to identify the conditions leading to the 'strong and unrestricted' form.

Binary matrices under one-sided local constraints
Let us first examine the case when the one-sided local constraints r * are imposed on ensembles of binary matrices. As we show in the Appendix, in the canonical ensemble the likelihood is and reaches its maximum when the parameter θ takes the value θ * with entries corresponding to the soft constraint r θ * = r * . Substituting Eq. (62) into Eq. (61), we obtain the canonical entropy as Let us now turn to the microcanonical ensemble. Since the constraints are only one-sided, it is immediate to realize that the number Ω r * of configurations realizing the hard constraint r(G) = r * is a product of row-specific binomial coefficients, so that the microcanonical entropy S * mic can still be calculated exactly as the following simple generalization of Eq. (40): For the same reason, S * mic can also be exactly retrieved by explicitly integrating the complex-valued quantity P can (G * | θ * + i ψ) as prescribed by Eq. (20) (see Appendix): Combining the above results, we can calculate the relative entropy from Eq. (15) as The above quantity encodes the following difference between the two ensembles: in the microcanonical ensemble, the hard constraint r(G) = r * makes all the entries in each row of G mutually dependent, while leaving different rows independent of each other; on the other hand, in the canonical ensemble the soft constraint r θ * = r * leaves all entries of the matrix independent. As in the case with a global constraint, each entry g ij is still Bernoulli-distributed, but now with row-specific probability as we show in the Appendix. Correspondingly, in the microcanonical ensemble r is a deterministic vector fixed to the value r * , while in the canonical ensemble it is a random vector with expected value r * . The covariance matrix Σ * between the entries of r (i.e. between the n constraints) in the canonical ensemble is a diagonal matrix with entries where Again, we are going to discuss the (non)equivalence of the two ensembles together with the corresponding case of weighted matrices, after studying the latter below.

Weighted matrices under one-sided local constraints
We now move to the case when the one-sided local constraints r * are imposed on ensembles of weighted matrices. The canonical ensemble under such constraints is characterized by the likelihood which is maximized by the parameter value θ * with entries realizing the soft constraint r θ * = r * (see Appendix).
If we insert Eq. (71) into Eq. (70), we get The microcanonical entropy S * mic is instead given by the following generalization of Eq. (48): where we have expressed the number Ω r * of configurations realizing the hard constraint r(G) = r * as a product of row-specific negative binomial coefficients. Again, the microcanonical entropy can be obtained equivalently from Eq. (20) as follows (see Appendix): The relative entropy, which can be obtained from Eq. (15) as usual, equals and encodes the difference between the microcanonical ensemble, where the entries of each row of G are mutually coupled by the hard constraint r(G) = r * (while different rows are independent), and the canonical ensemble, where all entries of G are independent and geometrically (see Appendix) distributed with row-dependent probability As a consequence, while in the microcanonical ensemble the constraint r is fixed to the value r * , in the canonical ensemble it is a random vector fluctuating around r * according to the diagonal covariance matrix Σ * with entries (see Appendix) and eigenvalues

Ensemble nonequivalence for matrices under one-sided local constraints
We can now compactly discuss the (non)equivalence of canonical and microcanonical ensembles of both binary and weighted matrices under one-sided local constraints. As for the case of global constraints discussed in Subsection III B 3, we still have an exact knowledge of the canonical entropy, the microcanonical entropy, and the relative entropy. Moreover, these quantities can all be written, using Eq. (53), in compact expressions formally valid for both binary (+) and weighted (−) matrices. Indeed, the canonical entropy can be expressed by combining the expressions for S n [P * can ] = S * can in Eqs. (63) and (72) into the unified formula and, similarly, the microcanonical entropy can be obtained by formally combining Eqs. (64) and (73) into The above expressions can be used to calculate the relative entropy as and consequently • In the dense case where both r * i (for all i) and m are O(n), as discussed in Subsection II E (so that r * i /m converges to a finite constant), we can use Stirling's formula to expand both m! and r * i ! into Eq. (81) to obtain for binary (+) (in which case r * i ≤ m) and weighted (−) matrices.
• In the dense case where both m and r * i are finite, there is no asymptotic expansion that allows to simplify Eq. (81) in general, so S ± n [P * mic ||P * can ] has to be evaluated explicitly (simple examples are provided below). The important general consideration is that, irrespective of the specific values of m and r * i , Again, we can confirm that the above asymptotic expressions are consistent with what we would obtain from Eq. (26), which follows from the saddle-point approximation given by Eq. (21). To see this, noting that here K = n and that Eqs. (68) and (77) indicate that (Σ * ) ± is a diagonal matrix with entries for binary (+) and weighted (−) one-sided constraints respectively, we can compactly express Eq. (25) as  (85) for the other two regimes, under the respective assumptions on the scaling of m and k. Coincidentally, we also see that the stronger result in Eq. (27) turns out to be a very good approximation for the relative entropy even in these two regimes where, technically, the required conditions are not met. This means that, for the one-sided dense case with finite m, we can rewrite Eq. (26) asymptotically (i.e. for large n) as where C 1 (m) is a finite and positive constant. Moreover, from the known inequality e k k!/k k ≥ √ 2πk for the factorial, we see that C 1 (m) ≥ 1 as implied by comparing Eqs. (81) and (87). Finally, we also know from Stirling's approximation that C 1 (m) is not much bigger than 1, i.e. C 1 (m) > ∼ 1, and that it rapidly approaches 1: indeed when m diverges Eq. (27) holds exactly, which implies The fact that, in all regimes, S ± n [P * mic ||P * can ] (or equivalently α n ) is at least of order O(n) shows that Eq. (28) is violated and that EE breaks down for both binary and weighted matrices under one-sided local constraints, irrespective of the density and of the behaviour of m. This important finding generalizes the result, documented so far only for ensembles of binary graphs with given degree sequence [5,13,14] (and possibly modular structure [11]) and weighted graphs with given strength sequence [8], that EE breaks down in the presence of an extensive (i.e. growing like n) number of local constraints. Here, this result is extended to more general ensembles of matrices, i.e. asymmetric, rectangular matrices describing e.g. bipartite graphs, multivariate time series, multiplex social activity, multi-cast communication systems and multicell gene expression profiles with variable m. More importantly, this generalized setting allows for a qualitatively new phenomenon to emerge, namely the onset of 'strong' EN, as we now show.
Indeed, we can investigate the 'strength' of nonequivalence by comparing the asymptotic behaviour of the relative entropy with that of the canonical entropy given by Eq. (79). This expression can be evaluated in the usual three regimes as follows.
In the sparse case with r * i = O(1) and m = O(n), noticing that asymptotically (for large n) we have which dominates over the order O(n) of the corresponding relative entropy S ± n [P * mic ||P * can ] calculated previously in Eq. (83) for the sparse case. This implies that the limiting relative entropy ratio defined in Eq. (32) is R ± ∞ = 0 for both binary (+) and weighted (−) constraints, meaning that in this case the breaking of EE is still 'weak' as in the case of graphs with local constraints.
In the dense case with r * i = O(n) and m = O(n), Eq. (79) can be evaluated as which, again, dominates over the order O(n ln n) of the corresponding relative entropy calculated in Eq. (84). Therefore we still have R ± ∞ = 0 (weak nonequivalence). Finally, the dense case where both m and r * i remain finite as n → ∞ is the subject of the rest of this Section. Equation (79) implies that which, upon comparison with Eq. (85), shows that now the relative entropy grows as fast as the canonical entropy, signalling the 'strong' form of EN. Using the combined expressions given in Eqs. (79) and (80), we can explicitly calculate the relative entropy ratio introduced in Eq. (31) as follows: Using Eqs. (87) and (88), we obtain the alternative asymptotic (for large n) expression Comparing Eqs. (93) and (94) confirms that, as noticed above, C 1 (m) ≈ 1 also for finite m.
In general, taking the thermodynamic limit n → ∞ in Eq. (93) or (94) requires the specification of the value of r * i for all i. For the sake of illustration, we can consider the simplest case where each constraint has the same value r * i = r * (i = 1, n). Note that the resulting canonical entropy of matrices with constant one-sided constraint r * , given by Eq. (79), coincides with the canonical entropy of matrices with the implied global constraint t * = nr * , given by Eqs. (39) and (47) in the binary and weighted case respectively. However, the microcanonical entropy in the one-sided case, given by Eq. (80), is strictly smaller than the corresponding one for matrices with the implied global constraint t * = nr * , given by Eqs. confirming strong nonequivalence as defined in Eq. (33). To gain numerical and visual insight about the behaviour of R ± ∞ in Eq. (95), let us consider the binary and weighted cases separately.
In the binary case, Eq. (62) implies that, if r * i = r * for all i, then the Lagrange multipliers θ * i are all equal to Then, writing p * ≡ r * /m = e −θ * + /(1 + e −θ * + ) ∈ (0, 1), from Eq. (95) we obtain Using the above expression, in Fig. 2 we plot R + ∞ as a function of either p * (for fixed m) or m (for fixed p * ). We see that, for a wide range of values of p * , R + ∞ remains appreciably large for values of m up to one hundred. Moreover, values of p * closer to 0 or 1 than to 1/2 make R + ∞ larger. So, for empirical applications where the level of 'multiplexity' is moderate (i.e. small m), and especially away from the uniform case (p * = 1/2), there is a significant entropy reduction from the canonical to the microcanonical ensemble. By contrast, as m increases while p * remains fixed, R + ∞ decreases like ln 2πm m , as can be easily realized by applying Stirling's formula to Eq. (97). This coincides with the system progressively moving to the different regime where both m and r * i grow as n grows, which results in weak EN and R + ∞ = 0 as previously noticed. Similarly, if r * i remains finite while m grows, we enter the sparse regime for which R + ∞ = 0 as previously noticed.
In the weighted case, Eq. (71) implies that if r * i = r * for all i, then θ * i = θ * − for all i with Then, writing q * ≡ e −θ * − = r * m+r * ∈ (0, 1), from Eq. (95) we obtain Figure 3 shows the behaviour of R − ∞ either as a function of q * (with m fixed) or as a function of m (with q * fixed). Considerations similar to the binary case apply. The main difference is that, while R + ∞ has a symmetric behaviour around the value p * = 1/2 (arising from the fundamental symmetry of exchanging p * with 1 − p * and g ij = 1 with g ij = 0 in the binary case), R − ∞ decreases (a) monotonically as a function of q * (due to the lack of any symmetry of that sort in the weighted case). So now R − ∞ is larger for smaller m and q * . The above results illustrate what we had anticipated previously, i.e. that if m is finite (and the matrices are necessarily dense) then the ensembles feature a strong form of EN. Here, this form of EN is also 'unrestricted', as it holds irrespective of the value of C * or θ * , i.e. throughout the parameter space. To the best of our knowledge, this is the first evidence of a situation for which EN occurs in a simultaneously 'strong and unrestricted' form, i.e. the most robust manifestation of the breaking of EE documented so far. Its ultimate origin is the presence of an extensive number of local constraints, and not of phase transitions.

D. Two-sided local constraints
We now discuss binary and weighted matrices under two-sided local constraints ( r(G), c(G)), where r(G) is still the n-dimensional vector of row sums while c(G) is the m-dimensional vector of column sums, with entries c j (G) = n i=1 g ij (j = 1, . . . , m). We constrain both vectors to a given value ( r * , c * ) ≡ ( r(G * ), c(G * )). Note that the number of constraints is still extensive. Unlike the case of one-sided constraints, for two-sided constraints it is not possible to calculate the exact number Ω r * , c * of configurations in the microcanonical ensemble. By contrast, all canonical calculations can still be carried out analytically (although the value of the Lagrange multipliers can be determined only via implicit expressions). Therefore, as we show below, the matrix Σ * of canonical covariances between the constraints becomes a crucial tool to calculate the asymptotic behaviour of the relevant microcanonical quantities.

Binary matrices under two-sided local constraints
As usual, we start from the binary case. As shown in the Appendix, in the canonical ensemble the likelihood is and is maximized by the parameter values ( α * , β * ) defined implicitly by the following set of n + m coupled nonlinear equations: Unfortunately, in general these equations cannot be solved analytically to express ( α * , β * ) as an explicit function of ( r * , c * ). This is due to the fact that the presence of both row and column constraints couples all parameters. However, the equations can be solved numerically and the unique solution ( α * , β * ) can then be inserted into P can (G| α, β). This gives complete analytical control over the canonical ensemble. In particular, the canonical entropy is Note that, since this model has additional constraints with respect to the one-sided case with the same row sums r * , the canonical entropy above cannot be larger than the corresponding one-sided canonical entropy given by Eq. (63), i.e. we have the following upper bound: On the other hand, the microcanonical entropy S * mic cannot be computed analytically, although the asymptotic formulas based on Eqs. (24) and (25) can be used to estimate it from the canonical covariance matrix Σ * . As we show later, this leads to an asymptotic estimate of the relative entropy based on Eq. (26). As for the canonical entropy, the microcanonical one cannot be larger than the corresponding one given by Eq. (64) in the one-sided case with the same row sums, with the only difference that now the resulting upper bound is tight: Indeed, the configurations matching both the row and the column constraints in the two-sided case form a proper subset of the configurations matching only the row constraints in the one-sided case.
It should be noted that in the microcanonical ensemble both r and c are deterministic vectors fixed to the values r * and c * respectively, while in the canonical ensemble they are random vectors with expected values r * and c * . In the microcanonical ensemble, the hard constraints ( r(G), c(G)) = ( r * , c * ) create mutual dependencies among all the entries of G. In the canonical ensemble, the soft constraint ( r α * , c β * ) = ( r * , c * ) leaves all entries of G independent. As in all other canonical binary ensembles considered above, each entry g ij is Bernoullidistributed, but now with its specific parameters: as shown in the Appendix. For illustration, we consider the special case where the column sums are all equal to each other, i.e. c * j = c * for all j. In this case, since the corresponding Lagrange multipliers must also be all equal to each other (β * j = β * for all j), it is indeed possible to solve for the parameters explicitly. Indeed, Eqs. (101) and (102) reduce to the n + 1 independent equations where the second equation is simply the consistency condition c * = n i=1 r * i /m implied by the first one. This means that the parameter β * is actually redundant, as it could in principle be reabsorbed into a shift of all the α * i 's. In any case, the combination α * i + β * is found ex-plicitly by inverting Eq. (107): Note that, inserting this value into the expression for S * can in Eq. (103), we obtain exactly the canonical entropy found previously in Eq. (63) for the binary ensemble with one-sided (row) constraints specified by the same vector r * , i.e. the bound in Eq. (104) is fully saturated: Indeed, the two canonical ensembles are indistinguishable and all their properties are the same. However, the corresponding microcanonical ensembles remain very different, because in the two-sided case each of the m column sums has to match the exact value c * separately, while in the one-sided case only the total sum mc * of all the m column sums (which is necessarily implied by the row constraints) has to be matched exactly. Indeed Eq. (105) is a tight bound that cannot be saturated. Similarly, the covariance matrix Σ * is now a (n + m) × (n + m) matrix (calculated later) and its determinant is different from the one obtained in the one-sided case, where the matrix is n × n. Again, we are going to discuss the (non)equivalence of the two ensembles together with the corresponding case of weighted matrices, after studying the latter below.

Weighted matrices under two-sided local constraints
We now discuss EN in weighted matrices with twosided local constraints. The likelihood (see Appendix) is now and is maximized by the unique parameter values ( α * , β * ) defined implicitly through the n + m coupled nonlinear equations that can be solved numerically. The solution ( α * , β * ), when inserted into P can (G| α, β), completely characterizes the canonical ensemble. The resulting canonical en-tropy is and an upper bound is provided by the canonical entropy given in Eq. (72) for the one-sided case with the same row constraints r * : As in the binary two-sided case, the microcanonical entropy S * mic cannot be computed explicitly, but it can still be evaluated asymptotically from the determinant of the canonical covariance matrix Σ * using Eqs. (24) and (25). Correspondingly, the relative entropy can be computed using Eq. (26). The microcanonical entropy given by Eq. (73) for the corresponding one-sided case is still a strict upper bound for the two-sided entropy: As in the corresponding binary case, in the microcanonical ensemble both r and c are deterministic and fixed to the values r * and c * , while in the canonical ensemble they are random with expected values r * and c * . The coupled hard constraints ( r(G), c(G)) = ( r * , c * ) create mutual dependencies among all the entries of G in the microcanonical ensemble. By contrast, the soft constraint ( r α * , c β * ) = ( r * , c * ) leaves all entries of G independent in the canonical ensemble. In the latter, as for all weighted matrices discussed so far, each entry g ij is geometrically distributed, but now with its specific parameters: for g ij ∈ {0, 1, 2, . . . }, as we show in the Appendix.
Here as well, the special case where the column sums are all equal to each other (c * j = c * for all j) provides a nice example. The corresponding Lagrange multipliers are in this case all equal to each other (β * j = β * for all j) and this allows us to solve for all parameters explicitly. In particular, Eqs. (112) and (113) reduce to the n + 1 independent equations where, again, the second equation is equivalent to the consistency condition c * = n i=1 r * i /m. Inverting Eq. (118), we obtain explicitly which, if inserted into the expression for S * can in Eq. (114), produces exactly the canonical entropy found previously in Eq. (72) for the weighted ensemble with one-sided (row) constraints specified by the same vector r * : The upper bound in Eq. (115) is therefore fully saturated. Again, while the canonical ensembles are identical for the two cases, the microcanonical ensembles remain very different and the microcanonical entropy under twosided constraints is strictly smaller than the one under one-sided constraints: the upper bound in Eq. (116) cannot be saturated. Similarly, the determinant of the covariance matrix Σ * , which here is a (n + m) × (n + m) matrix (that we calculate later on), is different from the one obtained in the one-sided case. The (non)equivalence of the two ensembles is discussed below, in conjunction with the case of two-sided binary matrices.

Ensemble nonequivalence for matrices under two-sided local constraints
To investigate EN in the two-sided case, it is convenient to preliminary combine the results obtained so far in the binary (+) and weighted (−) cases as follows.
The canonical entropy S ± n [P * can ] can be evaluated by combining Eqs. (103) and (114), as well as the corresponding upper bounds given by Eqs. (104) and (115), into (see Eq. (79) for a comparison). It is easy to check that, in all the three regimes considered (sparse, dense with diverging m, dense with finite m), the above canonical entropy has the same qualitative behaviour as the corresponding quantity obtained previously in Eq. (79) for the one-sided case.
Unlike the one-sided case, the microcanonical entropy cannot be evaluated exactly, neither through a direct combinatorial formula nor via the complex integral approach, and we only have strict upper bounds given by Eqs. (105) and (116) in the binary and weighted case respectively, which we can combine as follows: (see Eq. (80) for a comparison).
We can now discuss EN in a combined fashion for binary and weighted matrices. While we cannot calculate the relative entropy exactly, we can correctly evaluate its asymptotic scaling via Eq. (26), because the canonical covariance matrix (Σ * ) ± between the constraints can still be calculated analytically as a function of the parameters ( α * , β * ), in both the binary and weighted cases. In particular, it is easy to see that the entries (Σ * ij ) ± are arranged into a block structure, with a square n × n diagonal block (i, j ∈ [1, n]) representing the covariance matrix between pairs of row sums, a square m × m diagonal block (i, j ∈ [n + 1, n + m]) representing the covariance matrix between pairs of column sums, and two rectangular (n × m and m × n) off-diagonal blocks representing the covariances between row and column sums (i ∈ [1, n], j ∈ [n + 1, n + m] and i ∈ [n + 1, n + m], j ∈ [1, n]). As we show in the Appendix, these entries are The above expression is the generalization of Eq. (86) to the case of two-sided constraints. Once the values of r * and c * are specified, one can calculate the determinant of the above matrix and, through Eq. (26), the leading order of the relative entropy S ± n [P * mic ||P * can ]. As we show in the Appendix, the order of α ± n confirms the same scalings for the relative entropy found previously in Eqs. (83), (84) and (85) for the one-sided case: namely, α ± n = O(n) in the sparse regime, α ± n = O(n ln n) in the dense regime with m = O(n), and α ± n = O(n) in the dense regime with finite m.
In practice, unlike the one-sided case, calculating the values of S ± n [P * mic ||P * can ] and R ± ∞ (or bounds for them) as explicit functions of the constraints is not easy in general. It is however possible, and instructive, to consider a special case where S ± n [P * mic ||P * can ] and R ± ∞ in this twosided case (given the vectors r * and c * ) can be related to the corresponding values obtained in the one-sided case with the same vector r * (but without a constraint on c * ). Indeed, if we consider again the special case with constant column constraints (c * j = c * , j = 1, m) then from our previous results in Eqs. (110) and (121) we recall that, for any given value of n, the two-sided canonical entropy S ± n [P * can ] is exactly equal to the one-sided canonical entropy given in Eq. (79) corresponding to the same vector r * , while of course the two-sided microcanonical entropy S ± n [P * mic ] is strictly smaller than the onesided one given in Eq. (80). This automatically implies that S ± n [P * mic ||P * can ] in the two-sided case is strictly larger than the corresponding one-sided relative entropy given in Eq. (81). This proves that the scaling of the relative entropy is always at least O(n), irrespective of the density and of the value of m: in all regimes, EE breaks down for binary and weighted matrices under two-sided local constraints, as found in the one-sided case. The presence of the extra column constraints is not changing the qualitative behaviour of the relative entropy, but only its numerical value. Since the assumption of constant column sums only changes the values, but not the order, of the relative entropy, we expect that the scalings remain unchanged in the general case as well.
Moreover, EN has again the strong form (R ± ∞ > 0) in the sparse regime with finite m, because the value of R ± n = 1 − S ± n [P * mic ]/S ± n [P * can ] in the two-sided case is strictly larger than the corresponding one calculated previously for the one-sided case. In particular, we can use Eq. (93) to establish the following lower bound in the two-sided case with constant column constraints and finite m: The above inequality proves strong EN in this case as well. Again, we expect that relaxing the assumption of constant column sums will change only the value of R ± n , but not its strict positivity.
Finally, we can also establish an upper bound for R ± n by rewriting Eq. (26) asymptotically for large n, in anal-ogy with Eq. (88), as where C 2 (m) is a finite positive constant and noticing that, since the covariance matrix (Σ * ) ± is positivedefinite, we can use Hadamard's inequality stating that the determinant of a positive-definite matrix is less than or equal to the product of the diagonal entries of the matrix. This means where, using Eq. (124), we have introduced Now, using Eqs. (109) and (120) in the binary and weighted case respectively, it is easy to show that, in the two-sided case with constant column sums, the first n diagonal entries of (Σ * ) ± are identical to the n diagonal entries of the covariance matrix in the corresponding one-sided case given by Eq. (86), i.e.
Inserting the above expression into Eq. (128), and noticing that the last sum in the latter is strictly positive, we can writeα Combining Eqs. (126), (127) and (130) we obtain the upper bound (for large n) where we have used Eqs. (93) and (94) established in the one-sided case. Comparing Eq. (131) with Eq. (125) we see that we must have C 2 (m)/C 1 (m) > 1. We conjecture that, in analogy with C 1 (m) in the one-sided case, C 2 (m) > ∼ 1. Moreover, here as well we know that lim m→∞ C 2 (m) = 1 as in Eq. (89). This means that we expect that, for n large, C 2 (m)/C 1 (m) ≈ 1 so that the upper bound in Eq. (131) approaches the lower bound in Eq. (125), which is therefore a very good estimate of the actual value of R ± n in the two-sided case with constant column constraints: Upon comparison with Eq. (93), we see that R ± n remains practically unchanged with respect to the one-sided case with the same value of r * .
Note that the above result means that the decrease ∆ mic in microcanonical entropy introduced by the extra column constraints is subleading with respect to the canonical entropy. Indeed, denoting with {·} h a quantity evaluated in the h-sided case (where h = 1, 2), and exploiting again the identity of the canonical en- can ]} 2 and our conjecture C 2 (m)/C 1 (m) ≈ 1, we can use Eqs. (15) and the results obtained so far to express the decrease in microcanonical entropy as which is of order O(ln n), while the canonical entropy is of order O(n) in the dense case with finite m considered here. Clearly, if we additionally consider constant row constraints, i.e. r * i = r * for i = 1, n (where necessarily r * = c * m/n), then in analogy with Eq. (95) we can establish the following explicit lower bound for the value of R ± ∞ in the two-sided case with constant row and column constraints: Our expectation in Eq. (132) suggests that the above lower bound is a very good approximation for the actual value of R ± ∞ : leading to the same result as in Eq. (95) for the one-sided case.
The above results generalize the finding of strong EN to the two-sided case, again in the dense regime with finite m. The results do not change qualitatively, and apparently only slightly quantitatively, with respect to the one-sided case. This result points again at the fact that it is the extensivity of the constraints that plays the key role for EN: adding a finite number m of (column) constraints does not relevantly change the picture already obtained in the one-sided case.

IV. DISCUSSION AND CONCLUSIONS
We have studied the problem of EN in the general context of n × m matrices with given constraints. Such matrices can represent high-dimensional data such as multivariate time series, expression profiles, multiplex social activity, and other relational or structured data encountered in many settings. Their entries can either be binary (Boolean) or weighted (non-negative integers). The constraints imposed on these matrices represent sums over either all the entries of the matrix (single global constraint) or over individual rows (local one-sided constraints) and possibly also columns (local two-sided constraints). These constraints take the form of linear terms into the Hamiltonian at the exponent of the maximumentropy probability distribution characterizing the matrix ensemble.
Global constraints do not account for the heterogeneity (either spatial or temporal, i.e. nonstationarity) in the physical data-generating process, as they lead to probability distributions with identical parameters for all the entries of the matrix. By contrast, local constraints produce probability distributions with different local (rowand possibly column-specific) parameters. Most modern data structures are heterogeneous and/or nonstationary, and are therefore characterized by (at least) the type of local constraints considered here. Indeed, maximumentropy ensembles with local constraints are being increasingly used, either as null models for pattern detection or even as generative models and inference methods whenever there is only partial, local information available about the system [15,21].
We have shown that local constraints break the asymptotic (i.e. for large n) equivalence of canonical and microcanonical ensembles, where the constraints are enforced in a soft and hard manner respectively. By contrast, global constraints preserve EE. Mathematically, EE is encountered when the relative entropy between the canonical and microcanonical probability distributions is o(n). Importantly, the breakdown of EE observed here under local constraints occurs without phase transitions, which would require nonlinear constraints in the Hamiltonian and are therefore deliberately excluded from the cases we considered. The form of EN we observe under local constraints is also 'unrestricted', i.e. it holds for any value of the model parameters (here, for any graphical value of the constraints), while the mechanism for EN based on phase transitions requires specific parameters or phases. Our results hold in all regimes of density and for all values of m, and therefore generalize a recently discovered, alternative mechanism for the breakdown of EE observed so far in ensembles of binary graphs with given degree sequence [5,11,13,14] and weighted graphs with given strength sequence [8].
At the same time, our results highlight a qualitatively new finding. While the systems with local constraints studied in the past exhibited a 'weak' degree of EN (where the relative entropy is of smaller order compared with the canonical entropy, while still growing at least linearly in n), here we identified a regime for which EN is as 'strong' as in presence of phase transitions (i.e. with the relative entropy being of the same order as the canonical entropy). This regime is obtained when both m and the expected value of each entry of the matrix are finite, i.e. O(1). In practice, this means that the data structure is one where n grows as the size of the system grows, while m remains finite. This circumstance is naturally encountered e.g. when n represents a large number of timesteps during which a small number m of synchronous time series are observed (e.g. for EEG signals), or when n represents a large number of genes for which expression levels are observed in a small number m of cells at the same time, or when n represents a large number of users whose activities or preferences are recorded for a small number m of platforms, items, or other dimensions.
The simultaneously 'strong' and 'unrestricted' form of EN discussed here has never been documented so far, to the best of our knowledge. Indeed, in all the settings that had been studied previously, m was necessarily equal to n since the matrices represented the special case of square adjacency matrices of graphs, therefore the regime leading to strong EN could not be observed.
EN has important practical consequences. A traditional expectation in statistical physics is that, in absence of phase transitions or long-range interactions, ensembles are equivalent and it is therefore legitimate to freely choose the ensemble to work with, e.g. based purely on mathematical or computational convenience. For instance, if the ensemble is used as a null model for a real system, one may either want to randomize the data numerically by keeping certain quantities fixed (in which case the microcanonical ensemble is the most efficient choice) or prefer an exact mathematical characterization of the probability of each configuration in the ensemble (in which case the canonical ensemble is the easiest to work with). This view has been challenged by the recent discovery of EN under local constraints. Nonequivalence imposes a principled choice of the ensemble, that can no longer be based on practical convenience. For instance, if one has reasons to believe that the hypothesis underlying the null model, or the partial information available about the system, should be treated as a hard constraint, then one is forced to choose the microcanonical ensemble. By contrast, if one believes the constraints should be treated as soft (for instance to account for possible measurement errors leading to noisy values of the constraints in the data), then one should take the canonical route.
Our observation of strong EN shows that the quantitative differences between the two descriptions of the same system are much bigger than previously encountered in the case of weak EN. These big quantitative differences are exemplified by Eqs. (34), (35) and (36). A 'wrong' choice of the ensemble can therefore lead to major errors in the estimation of the probability distribution characterizing the ensemble, of the resulting entropy, of the expected values of higher-order properties that are nonlinear functions of the constraints, etc. Conclusions of statistical analyses can therefore be highly biased.
However, besides this warning, the findings presented here are intended to offer also a constructive solution. The fact that it is possible to rigorously quantify the differences between the two ensembles via the explicit calculation of the relative entropy ratio R n and its limiting value R ∞ implies that one can still make a convenient choice of the ensemble, while at the same time being able to retrieve the desired results for the other ensemble via the calculated value of R ∞ . In other words, besides being a warning signal for strong EN, R ∞ is also a concrete tool allowing researchers to switch more easily between alternative descriptions of the same system, by compensating for their irreducible differences. The calculations carried out here can hopefully serve as useful references for future quantitative research in a variety of domains.
The above Hamiltonian is the same for both binary and weighted matrices under a global constraint. However, the calculation of the partition function (hence of all the other properties) is different in the two cases.

Binary matrices under a global constraint
Let us consider binary matrices first (g ij = 0, 1). The partition function can be calculated as follows: (1 + e −θ ) This leads to and to Eq. (37) in the main text. Notice that Eq. (A3) reveals that each entry g ij of the matrix G is a Bernoullidistributed random variable taking the value g ij = 1 with probability p(1|θ) = e −θ /(1 + e −θ ) and the value g ij = 0 with probability p(0|θ) = 1/(1 + e −θ ), i.e.
(the entries of G are therefore i.i.d.). The expected value of g ij is while its variance is The resulting expected value and variance of the constraint t(G) are Var the latter identity following from the fact that, since all the entries of G are mutually independent, the variance of the constraint t(G) is the sum of all variances. Now, we have to find the parameter value θ * that solves Eq. (5) or equivalently maximizes the log-likelihood ln P can (G * |θ). This can be done by setting the expected value t θ * equal to the desired value t * , which leads to and, trivially, The calculation of the microcanonical entropy S * mic is in this case trivial and given by Eq. (40) in the main text. Given the simplicity of this example, it is instructive to show explicitly that the integral formula in Eq. (20), which can be calculated exactly in this case, gives the correct value of Ω t * . To do this, we use Eq. (A3) to obtain the complex-valued quantity and use Eq. (20) to calculate Ω t * as To calculate the above integral, we change variable from ψ to z ≡ e −(θ * +iψ) , so that dz = de −(θ * +iψ) = −izdψ and dψ = i dz/z. Then the integral becomes and, using the binomial formula we obtain Now, each integral in the above sum can be calculated using Cauchy's residue theorem, from which we know that the integral is non-zero only when the exponent of z is −1, in which case it equals −2πi. This selects the only value k = mn − t * in the sum, so that which coincides with the binomial coefficient used in Eq. (40).

Weighted matrices under a global constraint
We now consider the case of weighted matrices (g ij = 0, 1, 2, . . . , +∞) with a global constraint t * . The Hamiltonian is still given by Eq. (A1), while the partition func-tion is now calculated differently as follows: The canonical probability is therefore which leads to Eq. (45) in the main text. Equation (A19) shows that all the entries of G are i.i.d. random variables, in this case distributed according to a geometric distribution with success probability e −θ : The expected value of g ij is now and its variance is (note the change of sign at the denominator with respect to Eq. (A6)), from which we calculate the expected value and variance of the constraint t(G) as Var The maximum-likelihood parameter value θ * is found by setting the expected value t θ * equal to t * , resulting in Again, it is instructive to show that the complex integral in Eq. (20) gives the exact result corresponding to the microcanonical entropy reported in Eq. (48). Calculating the quantity and inserting it into Eq. (20) yields We first perform the change of variable y ≡ e −(θ * +iψ) , dψ = idy/y and rearrange the integral as Then we perform a second change of variable z ≡ y/(1 − y), dy = dz/(z + 1) 2 and apply the binomial formula in Eq. (A15) twice to obtain where we have defined Using again the residue theorem, the only non-zero integral is obtained for h = t * − k, which selects the value (where we have used the generalized Vandermonde's identity). The above calculation retrieves exactly the negative binomial coefficient used in Eq. (48).

Appendix B: One-sided local constraints
We now consider the case of one-sided local constraints on ensembles of binary and weighted n × m matrices. The constraint C(G) is now an n-dimensional (K = n) vector r(G) where the entry r i (G) = m j=1 g ij is the i-th row sum of the matrix G. Correspondingly, there is an n-dimensional vector θ of Lagrange multipliers and the Hamiltonian is for both binary and weighted matrices. The calculation of the resulting properties of binary and weighted ensembles is discussed separately below.

Binary matrices under one-sided local constraints
In the binary case, the partition function Z( θ) can be calculated from Eq. (B1) according to the following gen-eralization of Eq. (A2): The resulting canonical probability is which leads to Eq. (61) in the main text. As in the case of binary matrices under a global constraint, each entry g ij of the matrix G is a Bernoulli-distributed random variable. However, while all these entries are still independent, the parameter of the distribution depends on the row being considered: Consequently, Eqs. (A5) and (A6) generalize to We can therefore calculate the expected value of each constraint r i (G) as Similarly, the variance of r i is while all covariances between different constraints are zero, because of the independence of distinct entries of G: We can combine Eqs. (B8) and (B9) as follows: where δ ij = 1 if i = j and δ ij = 0 if i = j.
Now, the parameter value θ * that maximizes the loglikelihood is found by equating the expected value r θ * with the desired value r * . Inverting Eq. (B7), this leads to These results imply The microcanonical entropy S * mic can be directly calculated as Eq. (64) in the main text. We can still confirm that its value is exactly retrieved by using the integral formula in Eq. (20). From Eq. (B3) we obtain Using Eq. (20), we can calculate Ω r * by exploiting again the binomial theorem, a change of variables (z i ≡ e −(θ * i +iψi) , dz i = −iz i dψ i ) and the residue theorem as in Eqs. (A14), (A16) and (A17): which coincides with Eq. (64).

Weighted matrices under one-sided local constraints
In the weighted case, the partition function is given by the following generalization of Eq. (A18): The resulting canonical probability is leading to Eq. (70) in the main text. As in the case of weighted matrices under a global constraint, each entry g ij of the matrix G is an independent and geometrically distributed random variable. On the other hand, as in the case of binary matrices under local constraints, the parameter of the distribution depends on the row being considered: The resulting expected value and variance of g ij are given by the following generalizations of Eqs. (A21) and (A22): The expected value of each constraint r i (G) is therefore while the covariances between different constraints are As usual, note the change of sign at the denominator of Eqs. (B21) and (B22) with respect to the corresponding Eqs. (B7) and (B10) valid in the binary case. Using Eq. (B21), we set r θ * = r * and solve for θ * , finding as the parameter value that maximizes the log-likelihood. From the above expression, we get Eq. (71) and, using Eq. (B17), Eq. (72) in the main text. Similarly, the expression for the entries of the n × n covariance matrix Σ * given in Eq. (68) in the main text follows from combining Eqs. (B8), (B9) and (B11). Note that Eq. (68) can also be obtained by differentiating the logarithm of Eq. (B2) as prescribed by Eq. (23): So, in analogy with the binary case, The microcanonical entropy S * mic can be directly calculated as Eq. (73) in the main text. We can still confirm that its value is correctly retrieved by using the integral formula in Eq. (20). From Eq. (B17) we obtain Using Eq. (20), we can calculate Ω r * by exploiting again the binomial theorem as We can use the change of variables y i ≡ e −(β * i +iψi) , dψ i = −idy i /y i and the relation (1 − y i ) −m = (1 + yi 1−yi ) m to calculate Ω r * as Using another change of variables z i = y i /(1 − y i ), . Now, according to Cauchy's residue theorem, only when l + k = r * i we get a non-zero value. This allows us to further write which coincides with Eq. (74) in the main text.

Appendix C: Two-sided local constraints
We now discuss ensembles of binary and weighted n×m matrices with two-sided local constraints. In this case C(G) is (n + m)-dimensional (K = n + m) and specified by the two vectors ( r(G), c(G)), where r(G) is still the n-dimensional vector of row sums of the matrix G (as in the one-sided case) and, additionally, c(G) is the m-dimensional vector of column sums of G, with entries c j (G) = n i=1 g ij (j = 1, m). The corresponding Lagrange multipliers take the form ( α, β) where α is n-dimensional and coupled to r(G), while β is mdimensional and coupled to c(G). The corresponding Hamiltonian is As usual, the binary and weighted cases are discussed separately below.

Binary matrices under two-sided local constraints
Starting from the Hamiltonian in Eq. (C1), the partition function of the canonical binary matrix ensemble can be still calculated exactly as a simple generalization of Eq. (B2): The resulting probability is As in the case of binary matrices under global and onesided local constraints, each entry g ij of the matrix G is still an independent and Bernoulli-distributed random variable, now controlled by the two parameters α i and β j . We can write the probability of g ij as The expected value of g ij is now and the variance is The resulting expected values of the constraints are The unique parameter values ( α * , β * ) that maximize the likelihood are found as usual by imposing that the expected values ( r α * , β * , c α * , β * ) match the desired values ( r * , c * ). Unfortunately, in this case the values ( α * , β * ) cannot be determined analytically as a function of ( r * , c * ), but they are defined implicitly by imposing which leads to Eqs. (101) and (102) in the main text.

Weighted matrices under two-sided local constraints
In the canonical ensemble of weighted matrices under two-sided local constraints, the partition function is the following generalization of Eq. (B16): The resulting canonical probability is As in the case of weighted matrices under global and one-sided local constraints, each entry g ij of the matrix G is an independent geometrically distributed random variable defined by the probability p(g ij | α, β) = e −(αi+βj)gij [1 − e −(αi+βj) ], which is now controlled by the entry-specific pair of parameters α i , β j . The expected value of g ij is g ij α, β ≡ +∞ gij =0 g ij p(g ij | α, β) = e −(αi+βj) 1 − e −(αi+βj ) (C13) and the variance is The expected values of the constraints are As in the two-sided binary case, the values ( α * , β * ) maximizing the likelihood cannot be determined analytically as a function of the empirical values ( r * , c * ), but they are defined implicitly by imposing the equality ( r * , c * ) = ( r α * , β * , c α * , β * ) (C17) between the empirical and the expected values of the constraints. This equality leads to Eqs. (112) and (113) in the main text.

Determinant of the covariance matrix for two-sided local constraints
The covariance matrix (Σ * ) ± in binary (+) and weighted (−) ensembles of matrices under two-sided local constraints is an (n+ m)× (n+ m) matrix. It contains all covariances among the n row sums, all covariances among the m column sums, and all the covariances between row and column sums. If we order the constraints by considering first the n row sums r * and then the m column sums c * into the (n + m)-dimensional vector C * = ( r * , c * ), and combine Eqs.  and, following Eq. (22), It is easy to see that (Σ * ) ± is a combination of four blocks where each block has entries as described below. What determines these entries is the elements g ij of the (binary or weighted) adjacency matrix G that different constraints have in common. The covariance between contraints that have no g ij in common is zero (as such constraints are independent), while the covariance between constraints that share a term g ij receives from that term a contribution equal to obtained combining Eqs. (C6) and (C14).
• Similarly, block (C * ) ± is the m × n matrix of covariances between column sums and row sums, and is therefore the transpose of (B * ) ± , as follows also from the fact that (Σ * ) ± must be symmetric. Indeed its entries are • Finally, block (D * ) ± is the m × m matrix of covariances among the column sums, with entries (D * ij ) ± = Cov ± α * , β * [c i , c j ] = ∂ 2 ln Z ± ( α, β) ∂β i ∂β j where σ is a permutation of the first n + m integers that exchanges (without replacement) each of these integers i with another such integer j = σ i , Z n+m is the set of all such (n + m)! permutations, and the symbol sgn(σ) represents the parity of σ: sgn(σ) = +1 when σ is an even permutation (i.e. obtained by combining an even number of pairwise exchanges of the type j = σ i and i = σ j ) and sgn(σ) = −1 when σ is an odd permutation (i.e. obtained by combining an odd number of pairwise exchanges). Let us call σ 0 the identity permutation, i.e. the one such that σ 0 i = i for all i, and Z 0 n+m ≡ Z n+m \σ 0 the set of all other permutations. Clearly, sgn(σ 0 ) = +1 because σ 0 involves an even number (zero) of exchanges. We can therefore rewrite Eq. (C26) as is the product of the diagonal entries of (Σ * ) ± and We are going to show that ∆ ′ is at most of the same order of ∆ 0 . Setting c n = 1/n in the sparse regime (for which we recall that m = O(n) necessarily) and c n = 1 in the dense regime (for which m can be either finite or O(n)), we note that each entry of the blocks (B * ) ± and (C * ) ± is of order O(c n ), while each of the diagonal entries of block (A * ) ± is of order O(c n m) and each of the diagonal entries of block (D * ) ± is of order O(c n n). In general, the order of ∆ 0 is therefore To this end, we note that each permutation σ appearing in Eq. (C29) can be expressed as a combination of a certain number (say a > 0) of exchanges of pairs of the first n + m integers. It is easy to see that all the n 2 exchanges of pairs of the first n integers give a zero contribution to ∆ ′ , because they lead to terms of the type (Σ * i,j ) ± = 0 where i, j ∈ [0, n] with i = j (combination of exchanges that lead again to i = j are such that j = σ i = i and therefore do not lead to new permutations: they are already accounted for in permutations with lower a). Similarly, all the m 2 exchanges of pairs of the next m integers give a zero contribution to ∆ ′ , because they lead to terms of the type (Σ * i,j ) ± = 0 where i, j ∈ [n + 1, n + m] with i = j. Therefore the only exchanges leading to nonzero contributions to ∆ ′ are the nm exchanges across the first n integers and the next m integers, i.e. those that lead to terms (Σ * i,j ) ± > 0 where i ∈ [0, n] and j ∈ [n + 1, n + m] or j ∈ [0, n] and i ∈ [n + 1, n + m] in Eq. (C29). Each of these nontrivial contributing permutations involves a (unrepeated) exchanges of integers, where a ∈ [1, nm]. Compared with the identity σ 0 , each of these permutations replaces a of the first n diagonal entries and a of the next m diagonal entries of (Σ * ) ± appearing in Eq. (C28) with a number 2a of non-zero off-diagonal entries in blocks (B * ) ± and (C * ) ± (see Fig. 4).
Each such permutation therefore gives a contribution of order (c n n) m−a (c n m) n−a to the summation in Eq. (C29).
Individually, each such contribution is subleading with respect to the term ∆ 0 . However, collectively all the contributions involving the same number a of exchanges contribute a term of order E a (c n n) m−a (c n m) n−a where E a is the number of unrepeated exchanges of a pairs. The order of E a is given by the number of distinct choices of a exchanges out of the nm possible ones, which is nm a . This estimate does not control for the fact that, for each a, some of the exchanges reduce to simpler permutations already accounted for by smaller values of a, however the leading order is correct. Since nm is large, we can apply Stirling's approximation to (nm)! and estimate the order of E a as (C34) In other words, the off-diagonal terms of (Σ * ) ± do not alter the order obtained by multiplying the diagonal terms. For finite m, we therefore have α ± n = ln det [2π(Σ * ) ± ] = O (m ln(c n n) + n ln(c n m)) .
In the sparse case where c n = 1/n and m = O(n), we have while in the dense case with c n = 1 and m = O(n) we have α ± n = O (n ln n) , and finally in the dense case with c n = 1 and finite m we have confirming the same scalings for the relative entropy obtained in Eqs. (83), (84) and (85) for the one-sided case.