Correlation, response and entropy approaches to allosteric behaviors: a critical comparison on the ubiquitin case

Correlation analysis and its close variant principal component analysis are tools widely applied to predict the biological functions of macromolecules in terms of the relationship between fluctuation dynamics and structural properties. However, since this kind of analysis does not necessarily imply causation links among the elements of the system, its results run the risk of being biologically misinterpreted. By using as a benchmark the structure of ubiquitin, we report a critical comparison of correlation-based analysis with the analysis performed using two other indicators, response function and transfer entropy, that quantify the causal dependence. The use of ubiquitin stems from its simple structure and from recent experimental evidence of an allosteric control of its binding to target substrates. We discuss the ability of correlation, response and transfer-entropy analysis in detecting the role of the residues involved in the allosteric mechanism of ubiquitin as deduced by experiments. To maintain the comparison as much as free from the complexity of the modeling approach and the quality of time series, we describe the fluctuations of ubiquitin native state by the Gaussian network model which, being fully solvable, allows one to derive analytical expressions of the observables of interest. Our comparison suggests that a good strategy consists in combining correlation, response and transfer entropy, such that the preliminary information extracted from correlation analysis is validated by the two other indicators in order to discard those spurious correlations not associated with true causal dependencies.


Introduction
In proteins undergoing allosteric regulation [1], the binding of a ligand to the regulatory site affects the catalytic activity of the active site, generally placed at a distant location from the binding region [2]. The term 'allostery' was coined by Monod and Jacob [3] just to define such long-range effects activated across a molecule by the binding event to a specific site.
This sort of biological 'remote switching process' [4] is assumed possible as protein native states are not only structurally stable, but also susceptible enough to transfer signals among far away sites through longrange correlated fluctuations [5][6][7][8]. In practice, the release of the binding energy can trigger structural and/or dynamical changes in far regions of the biomolecule, thus allowing a fine control of the active site [9]. In this perspective, it is still unclear whether the coordinated motion of aminoacids in protein native states is a universal element to interpret such a wide phenomenology, and if allostery is a further manifestation of the structure-function relationship [10]. The theoretical methods generally applied to infer the structural origin of allostery in macromolecules are based on normal mode analysis (NMA) that can be performed either through fullatom simulations [11,12] or within less expensive coarse-grained approaches, such as the elastic network models (ENM) [13]. Other approaches deal with allostery as a problem of information transport across the network of interactions (contacts) defined by the topology of the native structure [14][15][16]. A typical scheme assigns Markov-chain like transition rules for exploring the network and then identifying allosteric paths connecting regulating and active sites, as the most probable ones [17]. This technique has been successfully applied to the study of the electro-mechanical coupling between voltage sensor domain and pore domain in voltage-gated potassium channels [18][19][20].
A popular tool to analyze the residue-residue coherent dynamics in a molecule with N residues is the equilibrium 3N × 3N-covariance matrix [10], where ∆r i (t) = r i (t) − R i indicates the instantaneous displacement of residue i from its native position, R i , taken as equilibrium position. The direct study of the correlations of the free molecule (apo-structure) and their variations after the binding (holo-structure) can show how the docking of a ligand to the regulatory site produces observable changes to the dynamics of the target site [12]. Under a physical-like interpretation, allostery can be discussed in terms of the response of proteins and enzymes to a local perturbation generated by the binding of a ligand. The reaction to binding determines the release of the mechanical strain that converts into the transfer of signal at the molecular scale from the source to the target [21]. In other words, according to the ensemble model scenario [22,23], allostery emerges from a modification of the native free-energy landscape of proteins under different 'effects': ligand binding, mechanic or chemical excitations, and environmental changes in pH or temperature.
More generally, allostery can also be associated with the notion of 'causation' in which regulatory and active sites are linked by a series of cause-effect pathways. There are two possible definitions of causeeffect relationship which correspond either to the interventional view or to the observational view of causation.
In the first, one directly performs local or structural modifications of a system and measures how they affect the behavior of specific system variables. This definition of causality was proposed by Pearl [24]. Conversely, the second consists in determining whether and to what extent the simple observation of certain variables is useful to predict the future of others, without manipulating the system.
The analysis of correlations, equation (1), belongs to the observational approach. However, 'correlations do not imply causation' , as they measure only associations among variables without explaining their cause-effect relationship [25]. For instance, two residues i and j can move jointly not because they are in direct interaction, but because they are driven by a shared group of other residues [25][26][27].
As a consequence, it is reasonable to suspect that a mere investigation of allostery based on correlations only could overlook relevant biological information.
On the contrary, response theory seems not only the natural framework to understand how a static [28] or dynamic perturbation [29,30] propagates from source to target site in proteins, but also the most reasonable approach to establish the causal influence between source and target. The detection of causation based on response theory recalls the interventional approach, according to which the coordinate x j causally influences the coordinate x i , if a perturbation of x j results in a variation of the measured value of x i . In formulae, we will say that i.e. a small perturbation on x j (τ ) at time τ results in a non-zero future variation on the average of x i (t + τ ) over its unperturbed evolution. In equation (2), we assume steady dynamics. If δx j is small enough, it is well known that the quantity (2) can be related to the spontaneous correlations in the unperturbed dynamics by one of the pillars of non-equilibrium statistical mechanics, the fluctuation-response theorem (FRT) [31], also known as fluctuation-dissipation theorem (FDT). The analysis of allostery communication beyond correlations has been already suggested in the literature. For instance, some authors looked at the structural properties of molecules to predict their allosteric behaviours [32], others [33,34] employed the transfer entropy (TE), a measure of causation borrowed from information-theory and introduced by Schreiber [35] and Paluš et al [36] in the context of stochastic processes and dynamical systems. The entropy transfer from the evolution of the coordinate x j (t) to the evolution of the coordinate x i (t) determines the information (uncertainty) that we gain (lose) on the future states of x i , if we not only consider the past history of x i , but we also include the past of x j . It quantifies the causal influence of x j on x i . In formulae, where H(a|b) indicates the conditional Shannon entropy [37] of the state a given the state b. Equation (3) assumes stationary processes. Notice that TE is by definition asymmetric, thus naturally incorporating a direction of the entropy/information transfer from x j → x i , that is generally different from For the sake of completeness, it should be mentioned that allosteric processes are often analyzed also in terms of pairwise mutual information and its high-order generalization called interaction information which are useful indicators of causal dependency [38]. In particular Interaction Information is appropriate when the complexity of the allosteric process involves clusters of variables at the same time so it cannot be decomposed into elementary pairwise dependencies [39]. However, since these quantities are generally used in pure statistical formulation which does not take into account the temporal evolution, they will not be discussed here.
In this work, motivated by the importance of allosteric mechanisms, we compare the behavior of the three mentioned indicators: correlation, response, and TE when applied to the human ubiquitin (Ub), a simple and very well-studied protein, which, according to recent experiments [40], has shown regulation of allosteric nature.
Since each of the three indicators extracts specific features from the fluctuations of the residues, their combined use should provide a more robust view of those coordinated movements that play a possible role in allosteric mechanism.
Especially from the analysis of the response, we will try to identify sites of Ub that are more susceptible to perturbation, and understand if they are possibly involved in the known allosteric pathways of Ub. Following [29,30], we employ the dynamic version of the indicators to exploit the information contained in their time dependence and in their propagation across the structure.
The exact comparison of these indicators in realistic all-atom simulations is severely limited by the system size and by the need of collecting very long time series (dimensional curse). To overcome this limitation, we describe the native state fluctuations of Ub via a coarse-grained approach, portraying the native state as a mechanical system made of nodes, representing the position of the α-carbons (C α ) of the aminoacids, connected by harmonic springs. This description is referred to as the Gaussian network model (GNM) introduced in [41]. The main advantage of using a Gaussian model relies on its full solvability, allowing the analytical expressions of correlations, responses, and TEs to be worked out.
In the literature, GNM and more generally coarsegrained elastic networks constitute simplified lessexpensive alternatives to all-atom NMA, generally used for fast and easy characterization of slow and large-scale dynamics of native structures. They allow the identification of flexible and rigid regions in huge single and multi-domain proteins and are proven to be meaningful in the prediction of functional modes relevant for a comprehension of the structurefunction relationship of biopolymers [42][43][44][45]. In addition, when the size of the system becomes inaccessible to all-atom NMA, the ENM represent the only viable approach.
It is important to recall that the GNM presents two crucial limitations. First of all, it does not apply to processes where a molecule changes its shape to visit other conformational states in order to perform its function. Therefore, the approach we follow here describes only 'allostery without conformational changes' proposed by Cooper and Dryden [22], and Ub falls into this category. Actually, there are generalizations of the ENM including also molecule transitions from one state to the other [46] by defining a GNM representation for each state and then assigning the transition rate among such states.
The second limitation of the standard GNM approach arises from neglecting the side chains, a common approximation of many coarse-graining methods. As we shall see in section 3, to partially relieve this crude approximation, we use a variant of GNM which is based on the 'heavy-atom contact map' (briefly heavy-map), that incorporates some effects of the side-chain presence.
Even if the small rearrangements in Ub allostery justify the use of the GNM-like approach improved by heavy-map representation, it should be noted that for allosteric processes occurring via transition among different metastable states, or involving not negligible nonlinear fluctuations [47], full-atomistic molecular dynamics, possibly supported by enhanced sampling techniques (e.g. conformational flooding [48]), remains the leading investigation tool.
The paper is organized as follows. In section 2 we briefly recall some structural properties of Ub and its allosteric control. section 3 reports the simplified theoretical framework for the description of Ub and the mathematical properties of the indicators (correlation, response, and entropy transfer) that we used to characterize the interplay among Ub fluctuation modes. Sections 4-6 contain the results obtained by the analysis based on these three indicators. In section 7, we show the results on the complex Ub-USP, ubiquitin and ubiquitin-specific peptidase (USP) to see how the Ub internal motion is modified by the interaction with one of its natural substrates. Final discussion and conclusions are drawn in section 8.

Allostery-like behavior of ubiquitin
In this section, we briefly summarize the principal biological information on Ub that has been used to orient our theoretical analysis. Post-translational modifications are covalent modifications altering the functional state of a protein; typical post-translational modifications involve the attachment of small chemical groups like acetyl, phosphate or methyl groups. Ubiquitylation, the attachment of Ub to its target substrates, can be considered an extreme case where the chemical group attached to the target protein is itself a small protein. Ubiquitylated proteins are normally targeted to degradation in the 26S proteasome, but ubiquitylation can also induce trafficking or endocytosis. Ub is bound to each target protein through The USP is shown in the background, while Ub is displayed in red and green, the green region represents Ub binding interface with the USP (as defined through a distance cutoff of 3.0 Å). The C-terminal tail of Ub is also part of the binding interface with USP. The red spheres represent Ile23 and the Asp52-Gly53 peptide-bond, while the Glu24 and Gln49 are shown in green, and the blue rod indicates the hydrogen-bond Glu24-Gly53. Panel (c) shows that the conformational switch (Asp52-Gly53) and the Ub-USP interface are located on the opposite sides of Ub emphasizing the allosteric nature of the regulation mechanism. the sequential action of three enzymes that ultimately connect the COO − terminal group of Ub with the side chain of a Lysine residue of the target. Since Ub also comprises several Lysine residues, it can become the target of further ubiquitylation reactions that create a linear chain of Ub units bound to the target protein. The geometry and bonding pattern of these chains determines a different fate of the marked molecule.
A recent work by Smith et al [40], based on NMR relaxation dispersion experiments highlighted a conformational switch of peptide bond Asp52-Gly53 of allosteric significance. As sketched panels (a) and (b) of figure 1, showing portions of structures from PDB files 1UBI [49] and 2IBI respectively, this bond flips between two states referred to as NH-out (1UBI) and NH-in (2IBI).
In the NH-out state, the NH group of the peptide bond points towards the bulk where it forms hydrogen bonds with water molecules (black arrow). As a result, the only possible interaction with the neighboring Glu24 residue is a hydrogen bond between the CO group of the Asp52-Gly53 bond and the NH group of Glu24. The sidechain of Glu24 is therefore not involved in this interaction. By contrast, in the NH-in configuration the NH group of the peptide bond points towards the interior of the protein (black arrow), where it can H-bond the sidechain of Glu24, that is also hydrogen bonded by the NH group of Glu24 itself. To assess the functional significance of this conformational switch Smith et al performed a bioinformatics analysis on a database of Ub experimental structures. This analysis suggested a correlation between the NH-in conformation and the binding of the Ub to the ubiquitinase USP (ubiquitinspecific protease). Figure 1(c) shows the structure of the complex Ub-USP. This result was completely unexpected since neither Asp52 nor Gly53 is directly involved in Ub-USP binding. The finding thus led to the hypothesis that the switch of the Asp52-Gly53 bond might induce an allosteric rearrangement of the USP binding region of Ub. Indeed, further analysis showed that the NH-in and NH-out states are respectively associated to the contraction and expansion of the ubiquitinase binding interface. Moreover, it was shown that a contracted binding interface allows fewer steric clashes, energetically promoting the binding of USP. The mechanism can thus be summarized as follows: the NH-in state allosterically induces the contraction of the binding interface reducing the number of clashes and favoring the USP binding. The residues more affected by this interface deformation are Gly35 and Gln49. Interestingly, these results agree with an older work by Massi et al [50] that identified chemical exchange processes affecting Ile23, Asn25 (flanking Glu24), Thr55 (close to the Asp52-Gly53 bond) and Val70.
The allosteric regulation of Ub can be seen as a propagation of perturbation from the couple of aminoacids (Glu24,Gly53) that we consider as source to the couple (Gly35,Gln49), that instead acts as target. In the following, for convenience, we shall refer to these sites as the allosteric set, ASet = {24,35,49,53}.

Model and methods
In a coarse-grained approach, the native state of the Ub is described as a mechanical system made of nodes, representing the position of the α-carbons (C α ) of the aminoacids, connected by harmonic springs. The potential energy of GNM is very simple and reads where {∆r 1 , ∆r 2 , . . . , ∆r N } are the instantaneous displacements of the N C α atoms from their native positions taken as equilibrium states. The quantity g defines the adjustable energy scale that can be set by matching the theoretical mean square displacement of the C α from their native positions with the experimental crystallographic B-factors [51][52][53]. The coefficients K ij are the elements of the coupling matrix K, often termed Kirchhoff matrix, which is defined through the contact matrix elements, A ij (also connectivity matrix of the network) through the relation where A 0 is a factor weighting the strength of the harmonic interaction along the chain (backbone) over the off-chain interaction. A 0 ∼ 10 seems a reasonable value to distinguish the role of the backbone links from the rest of the network. The effect of the side chains can be partially included by using a GNM approach based on a heavyatom contact-map [54] that excludes hydrogens. In this scheme, a pair of residues i − j is connected by a spring if, they have at least a couple of heavy atoms a, b in contact in the native state of Ub (PDB: 1 ubi). In formulae, this means that A ij = 1 if the relation holds, with a cutoff r c = 5 Å, where Θ(u) is the unitary step function. Panel (a) of figure 2 shows the heavy contactmap representing the native interactions, panel (b) shows the topological diagram of Ub secondary structure that includes five beta-strands S1, ..., S5 and two helices H1, H2.
In the following, we will redefine ∆r i → r i to simplify the notation.
The full NMA of Ub would require the computation of the Hessian matrix, obtained by computing the second partial derivatives of the force-field potential on the equilibrium state. The less complex Hessian of the GNM turns to be decomposed into blocks where r i denotes the position of the ith C α and µ and ν indicate the generic component x, y, z. In practice, the three coordinates of a C α becomes formally equivalent x i ≡ y i ≡ z i , hence the position vector r i = x i e is virtually a scalar (with e = (1, 1, 1)), and Therefore GNM approach reduces a system of 3N degrees of freedom to a system in N degrees of freedom only; equivalently, it deals with protein fluctuations as a problem of scalar elasticity.
The equation of motion for each GNM coordinate in the overdamped regime reads here, γ denotes the friction, k B the Boltzmann constant and ξ i is a zero-average and time deltacorrelated Gaussian process. Hereafter, we set µ = g/γ. From the solution (in vector form) of equation (7) x correlation and response are straightforward to obtain as where C(0) = ⟨x(0) x T (0)⟩ is the equal-time correlation matrix also termed equal-time covariance matrix and the average is over the thermal noise. The advantage of using the GNM relies on its full solvability, since it defines a multivariate Onrstein-Ulehnbeck process (7) whose explicit solution only requires numerical diagonalization of the sparse matrix K. K is symmetric and diagonalizable with all real not negative eigenvalues, but not invertible because it has one vanishing eigenvalue, due to the translation invariance of the system. It can be represented in the form where Λ is the diagonal matrix containing all the eigenvalues {0, λ(2), . . . , λ(N)} of K. λ(k) is the kth eigenvalue of K associated to its kth eigenvector, (1), . . . , u(N)} is the diagonalizing matrix whose columns are the eigenvectors, U † is its transpose, moreover UU † = I, I being the identity N × N-matrix.

Correlation
The Gaussian nature of the GNM implies that the equilibrium covariance matrix C(0) is proportional to pseudo-inverse K * of the matrix K, K * replaces the standard matrix inversion, as K is not invertible because of its vanishing determinant due to the translation invariance of the system. By definition K * = UDU † , where the diagonal matrix D is According to equation (9), the explicit elements of the time dependent correlation matrix are the above sum excludes the λ(1) = 0 eigenvalue.

Response
The response definition equation (2) assumes a clear and general expression when the process x(t) is stationary with invariant probability density function (PdF) P s (x), [56] Where the average ⟨· · · ⟩ is taken over the unperturbed system, whose invariant PdF, P s (x), is required to never vanish for any x. R(t) is the matrix of the linear response functions (at time t) of the considered system. Equation (13) shows the existence of a rigorous link among responses and correlations, so that we can use such a relation to infer a degree of dependence from the observation of time series of x(t), without actually perturbing the system. It should be remarked that equation (13) holds for systems with an invariant PdF and in general cases, it expresses the response in terms of complicated multivariate correlation functions. However, in systems governed by stochastic linear dynamics, such as equation (7), even with no Gaussian noise, the response turns out related only to the two-time correlation function [25,57], C −1 (0) indicates the inverse or of the correlations matrix at zero lag (i.e. the covariance matrix). Let us note that the translational invariance of the GNM does not allow to apply directly formula (13) to compute the response functions, as the invariant PdF, P s (x) = P s (r 1 , r 2 , . . . , r N ), cannot exist because is not normalizable. However, we can use equation (14), providing to employ the pseudoinverse of C(0). The explicit expression of the response in equation (10) is obtained by the representation R(t) = Ue −µKt U T , which in matrix elements reads In equation (15), the term 1/N corresponds to the vanishing eigenvalue λ(1) = 0, indeed because of the translation invariance, the zero-mode is the uniform vector with components u i (1) = 1/ √ N, for all i. As a consequence, R ij (∞) = 1/N, while, by definition, in the opposite limit R(0) coincides with the identity matrix, R ij (0) = δ ij .

Transfer entropy
According to the definition (3) the explicit TE expression between residues j → i, with τ = 0, is the quantity where the angular-brackets indicate the average over the joint probability density P[ ] are the conditional probability densities of x i (t) conditioned to the previous values x i (0) and x j (0). Obviously, TE is not symmetric, TE j→i ̸ = TE i→j , and identically van- Notice that the asymmetry of TE stems from the presence of conditional probabilities, for which the conditioning and conditioned variables (or events) cannot be exchanged because they do not play equivalent roles. For a Gaussian system like the GNM, there is a simple way to evaluate the TE among residues by using the correlations [58], equation (12), where and see [58] and appendix for a sketch of the derivation. Formula (17) is a more compact notation of the expression reported in [33]. The asymmetry α ij (t) ̸ = α ji (t) and β ij (t) ̸ = β ji (t) is an immediate consequence of the TE asymmetry emerging also in the Gaussian formulation.
It is immediate to show, that T j→i (∞) = 0, either by definition equation (16) invoking the independence of events far away in time, or using the correlation decay at large times in equation (17). Analogously, one expects T j→i (0) = 0.
In the following, we perform correlation analysis, response analysis, and transfer-entropy analysis of the GNM of the Ub, and when possible, we shall attempt a critical comparison among them. Figure 3 reports the equilibrium residue-residue covariance matrix for Ub in Pearson's form, (i.e. normalized by the variances)

Correlation analysis
using a temperature color code. The positive and negative correlations are plotted separately to facilitate the analysis. High values correspond to hotter and low values to darker colors. The lighter regions of the density plots show that positive correlations 'nucleate' around the native contacts, panel (a) of figure 3. In particular, the region around the residues Glu24 and Gly53 shows a correlation presumably driven by the native contacts in those segments of the protein. Also, the region around 40-70 results highly correlated and it involves the residues around the hydrophobic patch (HP) (Leu8-Ile44-Val70) [59]. This may suggest a sort of concerted motion among the elements of the HP associated with the binding to enzyme E1 [59]. Anti-correlated motion seems to involve mainly residues that are not in native contact and spatially distant, green and red regions of figure 3(b). Figure 4 reports the time behavior of pairwise correlations among residues of ASet that are considered interesting to the allosteric communication in Ub. Since these residues do not form any native contact with each other, for comparison, we also report the correlation between sites Ile23-Gly53, forming a native contact. The slowest and fastest autocorrelation decay (not shown), C i,i (t) ∼ exp(−t/τ i ), allows us to select the interval of time, 0.065 ⩽ τ i ⩽ 0.726, which can be considered the optimal time range for sampling the observables, because fluctuations are to be considered still active (see the vertical dashed lines in figure 4). In the following, we shall choose the sampling times t = (0.20, 0.25, 0.30, 0.35) that are equally spaced in this interval. In allosteric communication it is common to define source (allosteric site that binds the effector) and the target or active site. Similarly, we consider the allosteric regulation of Ub, as a propagation of perturbation from a couple of aminoacids (24,53) (source) to the couple (35, 49) (target), which participate in the binding interface Ub-USP. Accordingly, figure 5 plots the correlation profiles, C p,j (t), from the sites in ASet to every site j = 1, . . . , N = 76 of the Ub, at different times to monitor the evolution of the correlation spreading through the protein structure. The times at which the profiles are sampled are such that the correlations are still significantly different from zero.
The peaks of figure 5 indicate the sites that are mainly correlated, and obviously, the highest peak refers to autocorrelation C p,p (t) that decays as C p,p (t) ∼ exp(−t/τ p ) at large time.  In the chosen time interval, the global structure of the profile remains stable during the time evolution, including positive and negative correlations. For example from the top panel, a positive correlation Glu24-Gly53 is always established, whereas Glu24 is mainly negatively correlated with Thr9, Gln40, and the C-terminus tail 70-76.

Response analysis
The correlation analysis of the previous section identifies the presence of simultaneous coordinated displacements among residues, however, it says nothing about their effective dependencies. A better indicator for such a purpose is the response function. In analogy with correlation analysis, we plot in figure 6. the time-behavior of the responses R ij (t) among the usual set of residues of ASet. The threshold 1/N in equation (13) identifies two behaviors of response: a class of curves that grow monotonically in time to 1/N from below, e.g. Glu24-Gly35, Glu24-Gln49, Gly35-Gln49, and Gly35-Gly53, nonmonotonic responses which may exhibit either a single or double peak. All the native responses show a common pattern with either a single pronounced peak or a less pronounced maximum at short times and a broader maximum at larger times. However, also a pair of sites not in contact can show a peaked response, as there could be a path of native contacts connecting them, but their peak is always smaller than the peak associated with a native pair. This is an obvious consequence of the 'causal nature' of the response which is more sensitive to direct interactions. We can say that the sites which can be really considered active from a response scenario are those with responses crossing the line 1/N, reaching the maximum and decaying to 1/N from above.
Likewise to correlation analysis, to understand how a perturbation in a site 'p' spreads through the Ub structure, it is convenient to plot the profile R p,j (t) ≡ R j,p , for j = 1, . . ., N = 76. These profiles for the residues ASet are shown in figure 7, at the same four times sampled in the correlation study ( figure 5). A common feature of such profiles is the presence of a sort of 'response localization' generating peaks that survive at different times assessing the robustness of the scenario. Obviously, the most pronounced peak corresponds to the site 'p' where the perturbation is initially applied. This peak decays exponentially and spreads in time, since the perturbation naturally propagates along the backbone, covering a region of about ten sites centered on p, an immediate effect of the chain connectivity. To verify how the Ub backbone drives the response propagation, we turned off all the non-bonded interactions so that the Ub becomes a pure harmonic chain with nearest-neighbor interactions only. The grayshadowed regions in figure 7 indicate the spreading of the perturbation along such a harmonic chain.
As already noticed, only the sites with peaks above the threshold 1/N (indicated by a dashed horizontal line) are considered to develop a significant response to the perturbation.
If the site p = Glu24, which is a source in Ub allosteric control, is perturbed, the maximum of the response is felt by the other source site Gly53, in agreement with NMR relaxation dispersion experiments [40].
With reference to the target sites Gly35 and Gln49, we can say that they do not have a reciprocal response action, however, the target Gln49 has a causal link with both sources Glu24 (actually site '23') and Gly53 which is included in the peak around Gln49.
Gln49, in turn, responds to perturbation of Gly53 either via the backbone or through the path Gln49-Tyr59-Gly53, in which only the link Gln49-Tyr59 is a native contact.   ij , at which each response function R ij (t) attains its maximum, providing a view of how the response spans the Ub structure. The response propagates by nucleation either from native contacts or from the backbone, and coalescence, see for instance the block, 1 ⩽ i ⩽ 30 and 50 ⩽ j ⩽ 70. The coalescence process involves the strands S1, S2 and S5 of the secondary structure reported in figure 2.
A perturbation on the site Gly35 involves mainly sites Ile13 and Leu71, although the meaning of this interplay is not biologically clear. This might induce us to suppose that the site Gln49 is somehow more relevant in the allosteric response than Phe35.
To provide a global view of how the elastic network of interactions supports the spreading of response across the Ub structure, we compute the time at which each response R ij (t) becomes maximal: figure 6 for the definition of t * ij . The result reported in figure 8 shows that the response propagates along the proteins by nucleation from native contacts and from the backbone region as well, then it expands across the structure, giving rise to a sort of coalescence process among certain secondary structure elements. In particular in the region 1 ⩽ i ⩽ 30 and 50 ⩽ j ⩽ 70, such a response coalescence involves the strands S1, S2 and S5 of the secondary structure depicted in figure 2. Interestingly, the coalescence puts in connection the two clusters, CluA and CluB, in which Ub is structurally partitioned [55,60].

Transfer-entropy analysis
Response functions are indicators that are easily interpreted in terms of causal dependencies, however in a GNM model at equilibrium, responses are symmetric (likewise correlations) and cannot distinguish the role of source and target of causation. For this reason, it is natural to complement response analysis with TE computation, equation (17), that is becoming an increasingly popular tool for characterizing the role of coordinated fluctuations in allostery processes [33,34]. The behavior of some representative TEs as a function of the time-lag is reported in figure 9. Since by definition, TE i→j is different from TE j→i , thus discriminating the concept of 'donor' and 'acceptor' of entropy, we consider both curves for each couple of residues already considered in figures 4 and 6.
This asymmetry has been used in [33]. to give each residue of the pair i − j the role of 'driver' or 'driven' , depending if the difference TE i→j − TE j→i is positive (i driver, j driven) or negative (j driver, i driven).
The plots exhibit a typical skewed bell shape growing from zero at t = 0 and decaying to zero as t → ∞, as discussed in section 3.3, and characterized by a single well-defined peak. Moreover, TEs between two residues that are in native contact  are larger than TEs of nonnative couples. The dependence of TEs on the time-lag can raise ambiguities in their comparison or ranking, which can be strongly affected by the chosen time interval. Since every choice of an optimal time can be questioned, it is necessary to repeat the analysis at different stages to confirm that the ranking is conserved at different times.
Instead of plotting the dominant direction of information flow TE p→j − TE j→p as suggested in [33, 34], we prefer to report in figure 10 the profiles TE p→j and TE j→p separately, assuming that the reference site 'p' can act as a source (donor) or target (acceptor) of information transfer, respectively. In our case, this choice is due to the high sensitivity of TE differences from the sampling times, which prevents their robust interpretation.
The TE p→p is trivially zero by definition, while just below and above p, the profiles show peaks corresponding to a significant flow of entropy associated with the activity of the backbone.
The TE analysis basically confirms the scenario drawn from the response functions, i.e. Gly53 and Glu24 are linked both as donors and acceptors of entropy. However, a new feature is highlighted by the TE, as the region around 1-20 (strands S1-S2 of figure 2(b)) and the C-terminal tail 70-76 displays a sensible entropy transfer, a detail not accounted for by the response analysis. This aspect is important since, as we will see in the following, such residues participate in formations of the complex Ub-USP and their capability of entropy transfer is dramatically affected by binding. The presence of humps of TE at the level of C-terminal tail might be relevant since this is the binding region of the Ub-activating enzymes, the tail fluctuations are thus expected to be tightly regulated by the allosteric switching.
Concerning Gly35 and Gln49, we can say that these two residues show a distributed entropy exchange over many segments of the Ub, including the terminal and interface region of the Ub-USP complex. In particular, panel (b) of figure 10. seems to suggest a sort of causal control between Gly35 and the wide region 45-68 of Ub, comprising the secondary motifs S4-H2-S5 of figure 2(b). Unfortunately, this connection remains unexplained in terms of mere structural properties. Conversely, Gln49 exchanges entropy with a large portion of the N-terminal region S1-S2 and with Tyr59, where there is a hump.
The profile analysis seems to indicate that TE is more sensitive than response function to the details of both molecular structure and modeling, so TE predictions need careful pondering.

Ubiquitin in complex with ubiquitinase: Ub-USP
In this section, we consider the complex Ub-USP to show how the behavior of the above indicators modifies upon binding and check whether this affects the role of source (24,53) and target (35,49) residues. While the molecular switch (peptide bond between Asp52-Gly53 and Glu24) is internal in the Ub and does not participate in the binding process, the residues Gly35 and Gln49, modulated by the switch, belong to the interface region Ub-USP, and thus are expected to undergo a higher perturbation due to the binding.
The GNM is applied to the full structure Ub-USP (pdb-id: 2IBI) that has been reconstructed via Modeller [61] because in the pdb-file 2IBI there were three gaps (three missing residues) Ser382, Tyr450, Asn537. The three gaps were filled using a homology modeling approach whereby the available structure USP was used as a template to model the conformation of the whole USP sequence (including the three residues whose structure was not resolved in the PDB file). Even if our GNM analysis involves the whole Ub-USP complex, we only present the results restricted to the Ub chain.
In the Ub-USP complex the correlation profile of the considered Ub residues becomes positive (figure 11), while in the isolated Ub, these residues were positively and negatively correlated with the rest of the molecule ( figure 5). This evident difference in correlation behavior among internal Ub residues when coupled to USP can be explained by observing that correlations are sensitive to the collective fluctuations of the whole complex. Except for this shift toward positive values, a comparison of figure 11(a) with the profiles in figure 5(b) suggests that the landscape of correlations established by residue Glu24 with the rest of the Ub remains basically unaltered, the same happens for Gly53, not shown, since such two residues do not participate in the binding surface. On the contrary, the profiles of residue Gly35 and Gln49 are significantly modified in the bonded Ub with respect to the case of free Ub.
The inspection of the response profile of Ub within the complex Ub-USP in figure 12 does not show relevant differences with respect to the profiles  We can conclude that, unlike correlations, the responses only vary because of the new interactions established by Ub with USP, most of them located at the interface Ub-USP. Thus in the Ub-USP, the response functions of the internal sites of Ub are only slightly modified by the presence of the USP, which acts as an external common driving. This is a further indication that the response function is less affected by spurious effects. In summary, one can conclude that correlations are also sensitive to long-range effects carried by collective fluctuations, whereas responses are expected to be more susceptible to relatively short paths of direct interactions.

Conclusions
The common use of correlation-based methods for inferring functional modes among the coordinated fluctuations of biomolecular systems, although favored by its ease of implementation, is limited by the fact that correlation not necessarily implies causation.
Therefore, especially to understand allosteric control, it could be convenient to go beyond correlation through the employment of response-function and transfer-entropy that have the advantage of inferring the causal relationships among source and target sites.
In this work, we compared correlation, response and entropy-transfer analysis taking as a benchmark the ubiquitin protein that is widely studied and whose allosteric regulation has been recently discovered. The purpose was to test the capability of these three indicators to recognize the relationships among residues identified in the experiments as underlying the Ub allosteric behavior.
To make this comparison free from the artifacts of poor statistics, we describe the Ub fluctuations around its native state through the GNM [41] that, being solvable, provides exact expressions for any observable, easily accessible to the numerical computation. GNM is considered a reasonable coarsegrained approximation of the all-atom NMA, as it is able to capture the relevant features of slowest functional modes in proteins and enzymes [43,44].
Here, to partially include the effects of the side chains, we used the atom-wise definition of the GNM connectivity matrix, termed heavy-atom contact map, see equation (6). Of course, this represents only a caricature of the real side-chain effects which nevertheless improves the GNM reliability.
The analysis of correlation within the GNM suggests a scenario in agreement with the allostericswitch mechanism triggered by the flipping of Asp52-Gly53 peptide bond that modulates the interaction of ubiquitin and ubiquitinase. The correlation profiles along the Ub chains show several positive and negative peaks that identify a certain set of residues including those involved in the allosteric process. However, this positive result obtained by correlations could not be considered conclusive and required validation and further insight from response theory. Response-based analysis localizes on a subset of the positive and negative peaks present in the correlation profiles. These represent the pair of residues in which the correlated motion can be safely associated with a causal relationship.
However, the responses of a GNM at equilibrium are symmetric, therefore they do not distinguish whether a residue behaves as a driver (donor) or is driven (acceptor). Therefore it has been necessary to complement response results with the TE analysis.
The TE profiles indicate that the status of donor and acceptor for a residue is not intrinsic, but may depend on the time lag between two consecutive observations. A comparison of response and TE profiles shows that TE is more sensitive than response to the finer details of the molecular structure and to the modeling approach, thus making the interpretation of the results quite delicate.
It is important to observe that when the estimation of TE has to be carried out from data, the results can be affected by various factors, such as: the dimension of the space of states, the length of timeseries of data, and the procedure adopted to estimate high-dimensional conditional probabilities [62][63][64]. In addition, TE might be altered by spurious information arising from shared or common external inputs and drivings. Estimation of response instead is less affected by the above factors, and formula (14) allows us to have a proxy of the response from an easy reelaboration of the observed correlations of a system, thus reconciling the observational and interventional definitions of causation.
We remark that our comparison of causal indicators has been carried out in the case of harmonic approximations through NMA, however, there are frequent cases in which the nonlinear effects in allosteric processes are so important that NMA cannot be representative. In these conditions, the statistical indicators have to be estimated from very long molecular dynamics simulations in their atomistic [33,65,66] or coarse-grained [67,68] formulation.
Finally, we can conclude that the best strategy to gain insight into allosteric sites and pathways on protein structures is a proper combination of response and transfer-entropy analysis, where correlation could represent a preliminary but unnecessary step.

Data availability statement
The data will be made publicly upon request to authors The data that support the findings of this study are available upon reasonable request from the authors.

Appendix. Derivation of the Gaussian TE
This appendix briefly shows how equation (17)  where Ω = ⟨x(t) x T (t)⟩ is the covariance matrix of x(t). When computing the TE, we refer to the following coupling between the processes x i (t) and x j (t), represented schematically as follows where the arrows indicate that not only the ith variable, but also the jth variable is able to influence the future state x i (t). In the following, for the sake of shorthand notation, we set, x + i = x i (t), x i = x i (0) and x j = x j (0).
The Gaussian formulation is defined once the covariance matrix is specified where C µν (0)=⟨x µ (t)x ν (t)⟩=⟨x µ (0)x ν (0)⟩ denotes the correlation at equal times (zero-lag) for a stationary process, while, C µν (t) = ⟨x µ (t)x ν (0)⟩ = ⟨x µ (0)x ν (t)⟩ is the correlation at lag t, which, at equilibrium, is symmetric under index exchange. For convenience of notation we drop the argument '0' in the zero-lag correlations by defining: C µν (0) = σ µν . Using the definition of conditional distribution of a multivariate Gaussian process, we can express the TE in the form where Ω[x + i |x i ] and Ω[x + i | x i , x j ] are the covariance matrices of the conditioned Gaussian distribution, also termed conditioned covariance matrices, that can be expressed via the fundamental identity for Gaussian variables [69], In explicit form we can write ] .

(A.3)
Analogously using the conditional variance, we can write in equation (A.2) Finally the inversion of the matrix Ω (b, b), and the explicit computation of the matrix products provide the result . Now by adding and subtracting the term σ 2 ij C ij (t) to the denominator of the above expression, and by defining we can recast TE j→i in the form (17).