Structure of the space of folding protein sequences defined by large language models

Proteins populate a manifold in the high-dimensional sequence space whose geometrical structure guides their natural evolution. Leveraging recently-developed structure prediction tools based on transformer models, we first examine the protein sequence landscape as defined by an effective energy that is a proxy of sequence foldability. This landscape shares characteristics with optimization challenges encountered in machine learning and constraint satisfaction problems. Our analysis reveals that natural proteins predominantly reside in wide, flat minima within this energy landscape. To investigate further, we employ statistical mechanics algorithms specifically designed to explore regions with high local entropy in relatively flat landscapes. Our findings indicate that these specialized algorithms can identify valleys with higher entropy compared to those found using traditional methods such as Monte Carlo Markov Chains. In a proof-of-concept case, we find that these highly entropic minima exhibit significant similarities to natural sequences, especially in critical key sites and local entropy. Additionally, evaluations through Molecular Dynamics suggests that the stability of these sequences closely resembles that of natural proteins. Our tool combines advancements in machine learning and statistical physics, providing new insights into the exploration of sequence landscapes where wide, flat minima coexist alongside a majority of narrower minima.


I. INTRODUCTION
Protein evolution can be described as a stochastic process in the space of sequences.Although it is not possible to predict exactly the course of this process [11], evolution is strongly constrained by functional requirements.One of them is that of foldability, namely that most proteins must display a unique and well-defined native conformation to be functional.This is quite a robust requirement that filters out the vast majority of protein sequences [22].
The space of folding sequences is then a subset of the space of all sequences, whose properties affect the evolution of the protein.Proteins displaying a given function tend to conserve their structure (within an RMSD of 2.5 Å) even among very distant homologous [20].Consequently, it is reasonable to assume that conformational similarity to the wildtype protein is a feature that contributes to the functionality of a mutant, and thus to its evolutionary fitness.Such conformational similarity can be quantified by a distance from the structure of a reference wild-type protein, thus defining a landscape in which evolution is expected to take place through low conformational distance trajectories.This landscape in sequence space is analogous to the energy landscape of other complex systems studied in physics.In the case of disordered systems, like spin glasses, the energy landscape is rugged [16] and its minima are separated by high barriers that prevent diffusion across their conformational space [13].Although the excluded volume of the amino acids is like geometric frustration in glasses and in jamming problems, proteins are quite different from prototypical models of disordered systems.Proteins are small, the hydrophobic amino acids superpose a ferromagnetic-like interaction to the other disordered interactions, and their backbone makes their physical properties quite peculiar.
The space of sequences of proteins that fold to a stable native conformation was studied using minimal protein models that, although not realistic from the biochemical point of view and thus not predictive, display some of the complexity of natural proteins [7].The properties of a sequence in these models are only determined by its native energy because the rest of the conformational spectrum is self-averaging [21]; the thermodynamic properties of a sequence are thus determined essentially by the energy E N of the native conformation.
Monte Carlo techniques that control E N are then a suitable tool for sampling the space of sequences.In this way, it was shown that stable proteins display a complex hierarchical organization with regions not connected by single-point mutations and conserving few mu-tually interacting residues [24].Nonetheless, it was shown that folding sequences connected by neutral paths can visit vast regions of the space [6].
The recent development of machine-learning algorithms [10,12] able to predict the native structure of an input sequence paves the way to studying the space of folding sequences in the context of a realistic, predictive model.Using these algorithms, one can bypass the need of using effective energies, which are not always reliable, to characterize the foldability of a sequence.
We use a fast language model for structure prediction and we combine it with different exploration algorithms.Our method is designed to navigate efficiently through regions of sequence space that have high local entropy (neutral regions).We have put this method to the test on a well studied protein structure, generating predictions that are validated against existing data or through molecular dynamics simulations.The objective of this study is to demonstrate how various innovative approaches, such as language models and algorithms driven by local entropy, can be effectively merged.
The paper is organized as follows: first we describe the methods used to define the effective energy and sample the associated space within the framework of the canonical ensemble.
Then, we present the results obtained varying the selective temperature of the system.After selecting a realistic value of the selective temperature, we describe the structure of the energy minima in sequence space, focusing particularly on the width of the corresponding basins.
Inspired by the techniques used in connection with artificial intelligence, we finally test an algorithm that can identify large energy minima.We discuss the relevance of these results for protein evolution.

A. Sampling the effective energy of a sequence
In order to sample the space of sequences folding to a given reference conformation r 0 , we employed a canonical ensemble formalism where each sequence is characterized by an effective energy defined as the fraction of contacts that its native conformation has in common with r 0 , where ∆ ij (r) (∆ ij (r 0 )) is the contact map of the native conformation r (r 0 ), whose elements are 1 if any heavy atom of amino acid i is within 4 Å from any heavy atom of amino acid j and 0 otherwise, with |j − i| > 1 in order to eliminate the contribution of trivial contacts.
Thus, the energy ranges between 0, when all contacts of a sequence are the same as in r 0 , and 1 if all contacts are different.
The native conformation associated to a generic sequence of amino acids σ was predicted by ESMFold [12], a transformer protein language model defined by approximately 15 billion parameters trained over 65 million protein sequences.The same model was also employed to predict the structure r 0 = r(σ 0 ) of the reference sequence σ 0 .
The sampling was carried out with a Metropolis algorithm [15] at different temperatures T s (expressed in energy units), that here have the meaning of evolutionary bias towards good (i.e.low-energy) folding sequences.Throughout the simulation, at each step, a random single-site mutation was proposed and the newly generated mutant was accepted or rejected based on its energy, that is the Metropolis rate is here w(σ where the a priori probability is uniform for pairs of sequences with only one different site.

B. Ratcheted sampling
In order to estimate the energy barriers along trajectories from a sequence σ A to a sequence σ B , we employed a Metropolis algorithm which starts from σ A and damps the fluctuations in the direction opposite to σ B .This is based on the principle of the ratchet and the paw and it was used to generate trajectories in the space of protein conformations that resemble physical trajectories [3].
The Metropolis algorithm is applied with an energy that is given by Eq. (1) summed to where d m (t) ≡ min t ′ <t d(σ(t ′ ), σ B ) is the minimum Hamming distance to σ B encountered along the trajectory.This time-dependent energy favors the moves towards σ B , without exerting work to push the system.In this way, the system crosses the lowest energy barriers as in the unbiased trajectories [25].

C. Local Entropy
The local entropy has been introduced as a tool for analyzing complex energy landscapes in which flat regions coexist with rugged ones, in the context of non-convex neural networks [1].The local entropy of a discrete system is defined as and is meant to quantify the width of the energy basin around a sequence σ .Here γ is a Lagrange multiplier that controls the average distance from σ.
To define the neighborhood of a sequence, we define a distance that keeps into account the chemical similarity between amino acids, where is the transition rate from the β to the α amino acid as defined by the PAM1 matrix [4], setting the diagonal elements P (α, α) = 1.A further advantage of d P AM with respect to the Hamming distance d is that it varies essentially as a real variable.
From the identity and keeping in mind that one can derive the local entropy difference ∆S Ts,γ (σ) ≡ S Ts,γ (σ) − S Ts,∞ (σ) with respect to the single sequence σ.This is found by calculating the integral that can be estimated numerically from simulations performed at different values of γ.

D. Replica simulations
As discussed in ref. [2], from the local entropy measure one can derive several entropydriven search algorithms.Here we consider Monte Carlo algorithms, as presented in the Methods section, in which it is shown that a replicated MC process focuses on flat dense regions.
Each replica evolved by a Metropolis algorithm based on the coupling potential where E(σ i ) is the effective energy of the i-th replica (see Eq. 1) and γ * = γ T s , with γ being the Lagrange multiplier indicated in Eq. ( 3).
In each simulation, we increased slowly that value of γ, until the y replicas collapsed on a single high-entropy sequence.Eventually, we obtained a single simulation from a simulation.
We repeated the whole procedure to collect more sequences.

A. Thermodynamics of the space of sequences
We performed samplings of the sequence space at different temperatures T s for protein G, a widely-studied small protein [14] made of an alpha helix and two beta hairpins.Each simulation lasted for at least ∼ 3 • 10 5 steps (see some examples in Fig. S1 and Fig. S2 in the Supp.Mat.); we calculated from them the average energy and the specific heat using a multiple-histogram algorithm [5].The system displays a marked transition at temperature between sequences whose native structure has more than 80% common contacts with the reference structure to sequences with less than 50% of predicted contacts (upper panel in Fig. 1).The specific heat also displays a broad shoulder centered at , at which the average similarity between the contact maps is approximately 95%.It should be noted that the typical relative error in the prediction of the contacts of experimentally known structures is approximately 0.1 (cf.Fig. S4 in the Supp.Mat.), so in the low-temperature phase (i.e.T s ≤ 3.2 • 10 −3 ), native conformations are indistinguishable from the experimental one.
It is worth mentioning that, although the space of sequences is combinatorially large, the quantities of interest seem to have reached convergence in the simulation time.To check this, we removed the first steps of the simulation at which the auto-correlation of the Hamming distance from the reference sequence was above 0, we then divided the rest of the simulation into non-overlapping time blocks and we calculated the average and the standard deviation in each block, showing that they reach stationary values (cf.Fig. S1 and Fig. S2 in the Supp.Mat.).
ESMFold quantifies the degree of confidence in the predicted position of each atom with the pLDDT parameter [10].We calculated the average pLDDT over all the C α atoms of each sequence sampled at a defined temperature.The average pLDDT (lower panel of Fig. 1) of sequences sampled at low temperatures is comparable with that of extant sequences obtained from the pdb, which is 80.1 ± 12.6 (cf.Fig. S4 in the Supp.Mat.), suggesting that the algorithm is confident that the predicted structures correspond to the unique native state of the protein.The pLDDT is roughly constant at the value of approximately 85 for s , and then it starts decreasing.However, it remains above 70, which is commonly regarded as the threshold for a good prediction, up to T c s .At higher temperatures, it drops to 40, which is considered a mark of disorder [26].This suggests that at high temperatures not only sequences are not folding to structures different from the reference one, but they are not folding at all.Interestingly, at all temperatures, the sequence can step away from the initial, reference sequence (cf.Fig. S3 in the Supp.Mat.).Even at the lowest simulated temperature T s = 8•10 −4 at which the effective energy is essentially zero, the average similarity from the initial reference sequence is q(σ, σ 0 ) = 0.37.This result agrees with the experimental observation that real proteins can change up to ∼ 75% of their sequence while still maintaining their function [20].
To have a qualitative validation of the ability of ESMFold to predict correctly the folding properties of sequences that display low energy but are remarkably different than natural ones, we performed some molecular dynamics simulations of three sequences generated by ESMFold (Table II) with a protein model that is regarded as realistically predictive [19], starting from the putative native conformation for 200ns at T = 310K.The three sequences display an average RMSD to the initial conformation of 0.18 ± 0.04 nm, 0.22 ± 0.06 nm and 0.25 ± 0.09 nm, respectively (cf.Fig. S11 in the Sup.Mat).These values are the typical mutual similarities of homologous proteins [20], and they are lower than the value 0.36±0.10nm found for a selected high-energy sequence.

B. Structure of the space of sequences
A standard tool used to study the energy landscape of complex systems is the distribution of similarity q between the sampled states [16].In the present case, the value of q between two sequences is defined as the fraction of sites that host the same kind of amino acids.The  The results of the fit of p(q) with the model of Eq. 9.The lowest temperature cannot be fitted with a binomial.sampled distribution p(q) displays a unimodal shape in all simulations, whose maximum q EA increases at lower temperatures (Fig. 2).
A preliminary interpretation of these curves can be obtained using a very simple model in which the amino acids of the protein of length N = 56 can vary with uniform probability, except for a number N f of them that are fixed and identical in all sequences.This gives a binomial distribution where n aa is the number of different types of amino acids.
At high temperature (T = 1.71 • 10 −2 ) we find N f ≈ 0 and n aa ≈ 20 (see Table I), with the distribution peak centered at q EA ≈ 1/20.This is compatible with a state in which amino acids vary essentially at random.Thus, we conclude that for T s > T c s the system displays a single disordered phase.2), the p(q) is still compatible with a binomial distribution.The first two of them (where T n s < T s < T c s ) are fitted with the parameters N f = 0 and n aa < 20 indicating that, according to the minimal model, all residues of the chain can still change, but there is a selection on the type of amino acids they can host.
At the lowest temperature T s = 8 • 10 −4 (T s < T n s , at which the predicted structure is essentially identical to the reference one), the shape of p(q) is more irregular, with a tail reaching as maximum similarity q M = 1.This suggests that the explored manifold is more complex than at larger temperatures, with energy minima at any mutual distance.The fact that q M = 1 indicates that the number of minima is small enough that the probability that the system returns to the same sequences is not negligible.To rule out the possibility that these results are artifacts due to long correlation times in the sampling, we have downsampled the data, obtaining a distribution almost identical to the original one (dotted line in Fig. 2).
To challenge the minimal binomial model, we have then analyzed the degree of conservation of the sites of the protein as a function of the temperature.The site entropy S(i) ≡ − α p i (α) log p i (α), where p i (α) is the probability of observing the amino acid of kind α at site i, is zero if the site is perfectly conserved and log 20 ≈ 3 if it displays a uniform probability of hosting the 20 amino acids.At high temperatures (T s > T c s ), the distribution of amino acids is uniform in all sites (Fig. 3).At the lowest temperature (T s < T n s ), there are 7 sites that are never mutated and another 9 that are highly conserved, their entropy being lower than 1.Interestingly, approximately one-third of sites display an entropy larger than 2, comparable to that of high-temperature sequences.At the intermediate temperatures (T n s < T s < T c s ) there is still a (variable) number of low-entropy, highly conserved sites, and a majority of sites whose entropy is comparable to that of high-temperature sequences.
The picture that emerges is that, at all temperatures T s < T c s , there is a clear partitioning between highly and poorly conserved sites, and that the main effect of temperature is to define the ratio between the two.
As a consequence, in the case of protein sequences the distribution p(q) (Fig. 2) may not contain all the relevant information and it could be misleading, erroneously suggesting that even at low temperatures the system is in a highly disordered state.To overcome this problem, we defined a new distance d W that weights differently high and low-entropy sites, where K is the set of sites that display zero entropy at the lowest temperature T s = 8 • 10 −4 in Fig. 3, i.e. 5, 14, 26, 30, 41, 43 and 54.This distance accounts for sequence differences in two disjoint scales, namely units when differences are in K-sites and fractions of units when they are not in K-sites.
At the lowest temperature T s = 8 • 10 −4 (T s < T n s ), the distribution p(d W ) spans only one unit (Fig. 4), since all K-sites are conserved by definition.Raising the temperature to T s = 1.6 • 10 −3 (T s < T n s ), the distribution shifts towards greater values of d W , while still maintaining a peak at d W < 1.This indicates that visited combinations of the K-sites amino acids are closely distributed in the sequences space, contrary to the information carried by the standard Hamming similarity distribution (cf.Fig 2).The distribution of d W among sequences with the same residues in the K-sites (d W < 1) has a peak around 0.7 and a tail to 0, with the same shape as the case T s < T n s .This shape suggests that not all the pairs of sequences display the same distance, analogously to the replica symmetry breaking in spin glasses.On the other hand, peaks at d W > 1, corresponding to different amino acids in the K-sites, are more binomial-like, suggesting that the other sites are uncorrelated.Above vanishes (in agreement with N f ≈ 0 found from the binomial model fit, see Table I), while minor peaks arise at intermediate distances, d W ∼ 2 and d W ∼ 3.This is due to the fact that the number of relevant sites for the structural properties of the sequence diminishes as the temperature rises, since the structural constraint on the amino acid sequences becomes less and less rigid.This is also visible in Fig. 3, where at T s = 3.2 • 10 −3 , S(i) > 2 for the K-sites i = 14, 30 and 54.At T c s , the K-sites vary as all other sites.

C. Comparison with experimental data
Sequences produced by the natural evolution of protein G can be obtained from the PFAM database [18].The main statistical observable that can be calculated from these data and compared with the simulations is the site entropy (dashed curve in Fig. 3).
There are at least two important reasons that make the comparison difficult.First, PFAM sequences are not an unbiased ensemble that reflects the evolution of organisms but they are affected by the choices of researchers to study specific homologous.Moreover, simulations only require that a sequence folds to the correct native state but does not add any functional requirement.This simplification is likely to increase the entropy of sites that lie on the surface of the protein and that are involved in interactions with the cellular environment.
For this reasons, there is no value of T s at which the entropy of the simulated sequences matches that of the experimental data.Natural sequences conserve non-K-sites much more than any simulation.On the other hand, K-sites are partially conserved similar to what simulations do in the intermediate temperature range.
Studying the Pearson correlation coefficient between the experimental and the simulated entropy per site (cf.Fig. S6 in the Supp.Mat.), it is clear that there is not a significant difference in correlation for temperatures In what follows, we shall focus our attention on temperature T s = 3.2•10 −3 , which belongs to the intermediate regime as experimental data seem to do; at the same time, it is high enough that simulations are computationally fast.

D. Ruggedness of the landscape is mainly determined by changes in K-sites
An interesting feature of the energy landscape of sequences is its ruggedness.To investigate this point, we generated some artificial low-energy sequences, starting from random ones and quenching the temperature (see Table II), studying the energy landscape along the trajectories that link them to the protein G sequence (1PGB in Table II).
We used a ratchet algorithm (see Methods, Sect.II B) to generate trajectories at T s = 3.2 • 10 −3 between pairs of sequences.This algorithm does not push the sequence toward its target but only dumps fluctuations in the opposite direction.Consequently, we expect that it will not force the system to cross barriers higher than those that would cross spontaneously by thermal fluctuations [25].
Trajectories can leave the initial sequence in a few thousand mutations and reach the target sequence in less than 10 5 mutations (upper panel in Fig. 5).
The maximum energy reached by the simulation is in the range between 0.09 and 0.14 (lower panel in Fig. 5), which is larger than the spontaneous fluctuations that the system displays at this temperature, at which the mean effective energy is E = 0.074 ± 0.016 (cf.Fig. 1).This fact indicates that the system can encounter relevant energy barriers along its motion.The peak in the energy is close to the time when the K-sites approach that of the target sequence (dashed vertical lines) (cf.Fig. S8 in the Supp.Mat.).The peak is largest for sequence S2, which displays the most different K-sites combination from that of the protein G (cf. Tab.II).
After the K-sites are changed, all sequences take several tens of thousands generations to reach the target sequence.However, mutations of amino acids in sites that are not K-sites do not generate energy barriers comparable to thermal fluctuations.
Summing up, trajectories between low-energy sequences are neutral except for changes in K-sites, which generate barriers that are anyway surmountable by thermal fluctuations.

E. Local entropy of the basins
Proteins are expected to tolerate random mutations in order to be evolutionary fit [8].
Such tolerance can be characterized by the width of the neighborhood of a protein sequence σ, quantified by its local entropy difference ∆S Ts,γ (σ) (Eq.7), as done in the energy landscape of other kinds of complex systems [2].
We calculated the local entropy of the basins defined by natural sequences and by lowenergy sequences sampled by the Monte Carlo algorithm but not present in nature (cf.Table  II).

II).
For each natural sequence (see Table II), we first ran ∼ 10 4 steps at T s = 3.2 • 10 −3 in order to obtain, for each basin, typical sequences for that temperature which still maintain the same K-sites of the starting natural ones.The sequences produced by this equilibration process, which is necessary to compare correctly the width of the basins within the framework of the canonical ensemble (cf.Fig. S10), are labeled with an overbar (cf.Fig. 6 and Table S1).We then proceeded to calculate the local entropy difference ∆S Ts,γ as a function of the Lagrange multiplier γ that controls the average distance from the representative sequence, using Eq.(7).
Interestingly, at any value of γ the local entropy of the basins defined by natural sequences is significantly larger than those of artificial sequences (see Fig. 6).So, at any length scale around each σ, natural sequences display a wider energy basin than artificial ones, while the associated energies are similar (cf.Table S1).
The entropy ∆S Ts,γ is for all folding sequences markedly lower than that of the binomial model (dashed curve in Fig. 6).This is not unexpected, as the manifold sampled at realistic T s is not convex, as would be if the binomial approximation were correct.
Matching the dependence of ∆S Ts,γ on γ with that of q on γ, one can infer the dependence of ∆S Ts,γ as a function of q (see Fig. S9 in the Supp.Mat.).This curve displays a linear growth, indicating that the number of low-energy sequences in the neighborhood of each σ grows exponentially with the distance from it.

F. Searching for high-local entropy sequences
A relevant question is then whether there is a way to find efficiently sequences in wide energy basins, avoiding those that lie in narrow minima.In the field of artificial neural network, this goal was achieved sampling the space of the network parameters with replicas whose mutual distances are coupled together by the Lagrange multiplier γ and varying (annealing) slowly γ until the system converges to a unique set of parameters [2].We have applied the same strategy to the space of protein sequences, as described in Sect.II D.
Starting from random sequences, the system can converge to a unique sequence of low energy with annealings of the order of ∼ 10 5 steps (Fig. 7a).We compared these sequences with those generated with quenches from infinite temperature to T s = 3.2 • 10 −3 (Fig. 7b), recording a sequence when its energy reaches the average value at this temperature (cf.Fig.  I).

1).
In particular, we compared the K-sites of ten sequences obtained from the replica simulations with sequences obtained from ten temperature quenches.We defined q K as the maximum Hamming similarity between the K-sites of a simulated sequence with those of any natural sequence taken from the PFAM database.In this way, q K (σ) = 1 if there is at least a natural sequence displaying the same K-sites of the simulated sequence σ.The average q K of the sequences obtained from the replica simulations is 0.71 ± 0.21, which is significantly larger than the value 0.39 ± 0.20 obtained from the quenches (cf.Fig. 7c,d).
Furthermore, the p-value (obtained from a t-test) for the two distributions is 2.4•10 −3 .Thus, sequences in large basins are more similar from the point of view of K-sites combinations to natural sequences than other ones selected at random with comparable energy.
Molecular-dynamics simulations of a sequence found by the replica algorithm (cf.Fig. S11 in the Sup.Mat) display a RMSD to the putative native structure of 0.14 ± 0.02 nm, which is the more similar to the wild-type sequence 1pgb than those obtained from a temperature quench.
It is also worth mentioning that if sequences are let evolve for ≈ 3 • 10 5 generations after the quench, they can reach the widest basins where natural sequences lie (cf.Fig. S12 in the Sup.Mat.).This makes the case of protein sequences different from other complex systems for which the replica algorithm was originally developed, where the system is unable to reach wide basins with simple Monte Carlo moves [2].The main reason for this difference seems to be that the relevant dimensions of the sequence space are just the seven ones that define the K-sites.Thus, the effective dimension of this system is much smaller and it can therefore be sampled much more easily, than that of other complex systems.
We stress that the advantage of the replica algorithm to find wide basins is not much that of computational efficiency, that is marginal for a protein as small as protein G, but that of guaranteeing to ignore narrow basins, allowing us to distinguish the properties of wide minima (in the present case, the composition of K-sites) from those of minima at large.
Our proof of concept however shows that the algorithm can be easily extended to the case of more complex, larger proteins.

IV. DISCUSSION
Characterizing the space of sequences folding to a well-defined native structure is useful to define the constraints that bind evolutionary trajectories of proteins.Recently developed machine-learning models like ESMFold are an efficient tool to define the landscape of foldable sequences.
A concern intrinsically associated with this approach is the reliability of the predictions of the machine-learning model for sequences that are far from natural ones.A feature of ESM-Fold that makes it particularly suitable for our goal is that, at variance with other predictors, it does not use information from alignments of homologous sequences.Its prediction does not stem from what amino acids are observed in the very same sites in evolutionary-related proteins, but from the overall information coming from all available protein structures, which gives good predictions even for test sets not containing sequences homologous to those used for the training (cf.appendix B in ref. [9]).Consequently, its performance is not expected to drop as the sampling departs from the set of naturally-observed sequences.In fact, the molecular dynamics simulations we did starting from the predicted native conformation of sequences very far from the natural ones proved very stable.On the contrary, AlphaFold [10] did not produce any reasonable structure given the same sequences, since no homologous are found.
The picture that emerges from our sampling of the sequence space of protein G is that foldable sequences form a wide basin that contains all natural homologous and a constellation of smaller basins that display similar effective energy as the main one but that are narrower, displaying lower entropy.The different basins are characterized by different combinations of amino acids in a limited number of specific sites, here termed K-sites.The other amino acids, not belonging to K-sites, are rather free to mutate, thus generating a connected set of well-folding sequences up to very large Hamming distance from the wild-type.Only mutations in the K-sites seems to generate energy barriers corresponding to poorly-folding sequences.
The presence in the sequence landscape of different basins characterized by specific choices of amino acids in few key sites of the protein was already found in minimal models [24] and in simplified models with knowledge-based potentials [23].This fact suggests that it is not a consequence of the particular energy function used here, but it is a general feature of this kind of systems.Differently from what suggested in the case of simplified protein models, the key sites we found are not those critical for folding kinetics [14].
An important result of this study is that natural sequences folding to the structure of protein G belong to a wide basin, which maximizes the local entropy.One could hypothesize that being able to accumulate several mutations while maintaining the same native structure is an evolutionary advantage for a protein.
Wide energy basins can be found very efficiently with algorithmic schemes borrowed from the theory of complex systems.These do not seem to mimic in any way the evolutionary dynamics of proteins but are indeed a fast computational tool.For a small protein like protein G, we saw that it is possible to find the widest basin simply with a Monte Carlo algorithm in a manageable computational time, even without resorting to local entropy minimization.However, it seems unlikely that the same can be done for larger system, in which the dimension of the sequence space is larger.On the other hand, the algorithm for entropy-driven search can be made more efficient than in the present proof of concept in a number of ways [17].

FIG. 1 :
FIG. 1: Upper panel: the measured average effective energy E (blue circles) and the associated specific heat C v (red circles) calculated from the samplings at different temperatures T s .The solid lines indicate E and C v as estimated from a multiple-histogram algorithm.Lower panel: the values of pLDDT (green circles) representing the average degree of confidence associated to a sequence for its predicted native structure at the different temperatures T s .

FIG. 2 :
FIG. 2:The distribution of sequence similarity q between the pairs of sequences sampled at different temperatures T s .The dot-dashed line is the binomial fit of Eq. 9.The dotted line at T s = 8 • 10 −4 is obtained by decimation of the data, taking one sequence every 10 4 steps.

FIG. 3 :
FIG. 3: The degree of conservation of the sites quantified by the associated site entropy for the sequences sampled at different temperatures T s (coloured lines) and for the experimental homologous of protein G (dashed line).

FIG. 4 :
FIG. 4: The distribution of distances d W between the sequences sampled at different temperatures T s .

FIG. 5 :
FIG.5:The Hamming similarity parameter q (upper panel) and the effective energy E (lower panel) along trajectories generated with a ratchet algorithm between the system and the 1PGB sequence at T s = 3.2 • 10 −3 .The Hamming similarity parameter q is calculated from the initial (dashed curve) and from the target 1PGB (solid curve) sequence along each trajectory.The grey line and the shaded area in the lower panel are the mean energy and the associated standard deviation, respectively, at T s = 3.2 • 10 −3 .The vertical dotted lines mark the time when the Ksites approache the target combination of the 1PGB sequence.In the legend, the starting sequences for each trajectory (cf.TableII).

FIG. 6 :
FIG. 6: The local entropy difference ∆S Ts,γ (σ) plotted as a function of the parameter γ for (equilibrated) natural sequences and artificial ones.The grey dotted line represents the theoretical local entropy difference for the binomial model at T s = 3.2 • 10 −3 (see TableI).

FIG. 7 :
FIG. 7: (a) The effective energy of replicated simulations with y = 5.Transparent curves are the effective energy of each replica sequence, while the solid lines represent their mean.The vertical grey dotted lines indicate the value of γ along the annealing.(b) The effective energy of quenched simulations.The horizontal grey dotted line is the mean effective energy at T s = 3.2•10 −3 (reported in the bottom left corner).(c) The non-normalized distributions of q K for the replicated (in blue) and quenched (in orange) simulations.(d) The K-sites found from the replicated and quenched simulations, compared to the most similar amino acids from the PFAM database and the associated q K .

TABLE II :
Some of the sequences used in the calculations.1PGB, E1NUT2, K9EXL1A and K9EXL1B are natural sequences labeled by their pdb code.S1, S2 and S3 are artificial sequences obtained by quenching the temperatures in Monte Carlo samplings.In bold, the K-sites amino acids for each sequence.