DCA for genome-wide epistasis analysis: the statistical genetics perspective

Direct Coupling Analysis (DCA) is a now widely used method to leverage statistical information from many similar biological systems to draw meaningful conclusions on each system separately. DCA has been applied with great success to sequences of homologous proteins, and also more recently to whole-genome population-wide sequencing data. We here argue that the use of DCA on the genome scale is contingent on fundamental issues of population genetics. DCA can be expected to yield meaningful results when a population is in the Quasi-Linkage Equilibrium (QLE) phase studied by Kimura and others, but not, for instance, in a phase of Clonal Competition. We discuss how the exponential (Potts model) distributions emerge in QLE, and compare couplings to correlations obtained in a study of about 3,000 genomes of the human pathogen Streptococcus pneumoniae.


I. INTRODUCTION
Direct Coupling Analysis (DCA) is a collective term for a number of related techniques to learn the parameters in Ising/Potts models from data and to use these inferred parameters in biological data analysis [1].DCA has led to a breakthrough in identifying epistatically linked sites in proteins from protein sequence data [2][3][4][5][6], which in turn has been used to predict spatial contacts from the sequence data [7][8][9].DCA has also been used to identify nucleotide-nucleotide contacts of RNAs [10], multiplescale protein-protein interactions [11,12], amino acidnucleotide interaction in RNA-protein complexes [13] and synergistic effects not necessarily related to spatial contacts [14][15][16][17].
Skwark et al applied a version of DCA to wholegenome sequencing data of a population of Streptoccoccus pneumoniae [18], and were able to retrieve interactions between members of the Penicillin-Binding Protein (PBP) family of proteins as well as other predictions.S. pneumoniae (pneumococcus) is an important human pathogen where resistance to antibiotics in the β-lactam family of compounds are associated to alterations in their target enzymes, which are the PBPs [19].Further results were recently given in [20] showing robustness by using sequencing data from a pneumococcal population from another continent, and identifying a novel seasonal phenotype signal.Three of the authors of the current article additionally recently showed that DCA analysis on the bacterial genome scale does not need supercomputing resources, but can be carried out in a reasonable time (hours) on a standard desktop computer [21].
These advances raise the question why DCA works at all, and if one can identify from the outset when that is the case.As discussed by one of us "max-entropy" arguments sometimes evoked in the literature are not pertinent to this issue [22].Instead, we will here argue that at least for genome-scale data the answer lies in a very different direction.We will show that the Quasi-Linkage Equilibrium (QLE) of Kimura [23][24][25], as extended by Neher and Shraiman to statistical genetics on the genome scale [26,27], provides a natural and rational basis for DCA.According to this theory a population evolving with sufficient amount of exchange of genetic material (recombination, or any form of sex) will settle down to a dynamic equilibrium where the distribution of genotypes is of the form assumed by DCA.In the opposite case of little exchange of genetic material (little sex) the distribution over genotypes is different and dominated by clones, identical or very similar individuals descended from a common ancestors.In such a setting DCA is not an appropriate approach, and is likely to yield nonsense results.
We will also discuss the inference task of DCA in the context of QLE as realistically applied to biological data.We will first show that DCA can give a much more sparse representation of the data than correlations (covariances).This is in line with the intended meaning of the acronym DCA: the parameters in a Potts or an Ising model can be considered "direct couplings", and while these are typically reflected in correlations (covariances), the latter also includes combined effects, or "indirect couplings".Second, the authors of [26,27] assumed that a genotype can be described by a Boolean vector i.e. a string of 0's and 1's.This is almost never the case for population-wide whole-genome sequencing data due to varying gene content, which have to be represented as gaps.We have therefore generalized the theory to categorical data and to a model of bacterial recombination.Third, as surveyed in [1], DCA as a methodology has matured considerably over the last decade.For the mathematical task of inferring parameters in a Potts or Ising model from data which was generated from such a model, the Small-Interaction Expansion (SIE) used in [26,27] is inferior to many other inference methods.We will show that it is also inferior when applied to real data in the sense of yielding much less sparse results, and would also have specific problems when applied to simulation data.A conclusion of this work is hence that when QLE is combined with DCA on the genome scale, it should be combined with one of the modern and more powerful versions of DCA.
The paper is organized as follows.Sections II-V reformulate the theory of [27] in a way suitable to our presentation and for categorical data.Section II hence contains a non-mathematical overview, while Sections III and IV contain the specific changes needed for categorical data and our model of bacterial recombination.Section V formulates the dynamics of Potts model parameters in QLE phase, which is a central result of the theory.Section VI presents results for real sequence data and Section VII for simulation data.Section VIII contains discussion and outlook for future work.Technical details such as a derivation of SIE for categorical data, sequence and code availability are given in appendices.

II. STATISTICAL GENETICS AND QUASI-LINKAGE EQUILIBRIUM
We will here first present key concepts and results in a mostly non-mathematical manner.The driving forces of evolution are assumed to be genetic drift, mutations, recombination, and fitness variations.The first refers to the element of chance; in a finite population it is not certain which genotypes will reproduce and leave descendants in later generations.The three latter are deterministic, describing the expected success or failure of different genotypes.Mutations are hence random genome changes described by mean rates.
Recombination (or sex) is the mixing of genetic material between different individuals.In diploid organisms every individual inherits half of its genetic material from the mother, and half from the father.This material is further mixed up in the process called cross-over so that each chromosome of the child consists of segments alternately inherited from the mother and the father.By sequencing the parents and children in a single family the per generation mutation rate and number of crossover segments in human has been measured to be about 30 and 100 [28], numbers that are in line with previous estimates.By this measure recombination is hence in human about three times faster than mutations.In bacteria recombination happens by transformation, transduction, and conjugation.The ratio of recombination to mutations differ greatly between different bacterial species and can also differ between different strains and different environments of the same species.In this work we use data from S. pneumoniae where this ratio has been estimated from less than one to over forty, but with an average close to nine [29].Similarly to the analysis in [27] we will for the most part here assume that recombination is a faster and stronger effect than mutations.
Fitness means in statistical genetics a propensity for a given genotype to propagate its genomic material to the next generation.Like mutation and recombination fitness is hence here a rate, measured in units (time) −1 .Fitness variations refer to the variations of these rates.Consider then then the effects of recombination and fitness on correlated variations in a population, ignoring mutations and genetic drift.The correlation between alleles α and β at loci i and where f i (α) is the frequency of allele α at locus i and similarly f j (β), and where f ij (α, β) is the frequency of simultaneously finding alleles α and β at loci i and j.If there is recombination between i and j but are no fitness variations at all, then it is trivial to see that M ij (α, β) must decay to zero.This state is called Linkage Equilibrium (LE).
If now instead fitness variations are small but non-zero, then non-zero correlations may persist.We will assume that the fitness of genotype g which carries allele g i on locus i depends on single-locus variations and pair-wise co-variations, that is If so, the first central result of statistical genetics is that when recombination is sufficiently strong, the distribution will have the form The above distribution is also the Gibbs-Boltzmann distribution over variables g with energy terms h i and J ij , and where Z (the partition function) is the normalization.The second central result is that where r is an overall recombination rate and c ij is the probability that alleles at loci i and j are inherited from the same parent.For the most part we will in the following assume that c ij equals 1 2 , appropriate if recombination is sufficiently strong and loci i and j are sufficiently far apart on the genome.Note that the right-hand side of (3) is the ratio of two rates, and therefore dimensionless.For the distribution (2) the parameters J ij (α, β) carry the same information as the correlations M ij (α, β) but in a de-convoluted or "direct" manner.
Inferring J ij (α, β) from data is what the methods known as DCA achieve [1].From (3) this gives fitness parameters F ij (α, β) up to a proportionality (the overall rate r), and for pairs of loci sufficiently far apart on the genome so that c ij is approximately constant.Recombination does not change single-locus frequencies; in a stationary state parameters h i (α) instead result from a dynamic equilibrium between fitness and mutations.In the absence of mutations QLE in an infinite population is in fact only a long-lived transient while the h i (α) change slowly in time as the population drifts towards fixation.In a finite population both h i (α) and J ij (α, β) also fluctuate in time, and the prediction (3) does not apply directly.All these aspects have to be taken into account when applying DCA techniques to analyze a QLE phase.
The third central prediction of statistical genetics is that when fitness variations still have the form (1) but are not small compared to recombination, then the distributions will not be of the form (2). In that phase, in [26,27] called Clonal Competition (CC), the distribution is instead better described as where the sum goes over clones, µ c is the weight of clone c, and P c (g) is some distribution peaked around clone center g c .Statistical genetics hence predicts a parameter-dependent transition between the two canonical distribution families in high-dimensional statistics, namely the exponential model (( 2)) and the mixture model ((4)).A further difference between QLE and CC is that in QLE the joint distribution over more than one genotype approximately factorizes, P (g 1 , . . ., g N ) ≈ P (g 1 ) • • • P (g N ).In CC phase this is not so; genomes related by descent do not vary independently.A different analogy, discussed in [26,27] is that of equilibrium states in disordered systems [30,31]; (2) is like a high-temperature replica-symmetric para-magnetic phase, while ( 4) is like a low-temperature replica symmetry-breaking spin glass phase.

III. STATISTICAL GENETICS FOR CATEGORICAL DATA
In this section we summarize statistical genetics as formulated in [26,27] in a more technical manner, and generalize the theory to categorical data i.e. to when there can be more than two alleles per locus.Let there be M i alleles at locus i and let the allele be indicated by a variable l i that takes values 1, 2, . . ., M i .The frequency of allele α at locus i is These quantities satisfy Mi α=1 f i (α) = 1.The covariance matrix between loci i and j is The variance matrix at one locus is and satisfies Statistical genetics are evolution equations for the distributions over genotypes ) where the three terms on the right-hand side represent the changes due to mutations, fitness variations and recombination.The mechanisms of mutations and fitness are classical in population genetics, and known as Wright-Fisher models.
Single-locus mutations are hence modelled by matrices µ (i) α,β which give the rate at which allele α on locus i changes to allele β.Let F (i) α,β be the operator which if the allele at locus i is α changes it to β, and otherwise does nothing.In the dynamic equation for probability mutations hence enter as ) This gives contributions to the dynamic equations for the frequencies and correlations as In the simulations reported below all transition rates µ are the same.As discussed above it is often a reasonable assumption to take mutations a weaker effect than fitness variations and recombination.
Fitness variations, such as (1) above, act on the distributions over genotypes as where F = g F (g)P (g) is the instantaneous average of fitness over the population.Potts models are defined as As written the model over-parametrized since the same distribution is found by shifting all h i (α) by a constant c i or all J i,j (α, β) by a vector c ij (β), or a vector c ij (α).
In the DCA literature it is customary to go to the Ising gauge [2,32] given by

IV. BACTERIAL RECOMBINATION IN STATISTICAL GENETICS
Recombination (or sex) takes many different forms depending on if the organism is haploid or diploid and the type of recombination.The mechanism formulated in [26,27] is specifically for sexual reproduction in haploid yeast, where two parents each produce a mating body (copy of parent genome), and these two mating bodies merge and produce one new genome while the other half of the genetic material of the two mating bodies is discarded.As closer to our data we consider instead a form of bacterial recombination, for which however the evolution essentially turns out to be the same, modulo a Stosszahlansatz.
Recombination is thus (we assume) distinguished by two genomes merging and forming two new genomes.In an elementary step two genotypes are therefore lost (the parents) and two genotypes are gained (the offspring).Let E g1,g2→g 1 ,g 2 be the event that two individuals with genotypes g 1 and g 2 recombine and give two individuals g 1 and g 2 .To describe the kinetics of the individual process we assume that recombination between the two parents happen with rate rQ(g 1 , g 2 ) where r an overall rate of recombination and Q(g 1 , g 2 ) a relative rate.The two new genotypes g 1 and g 2 are specified by an indicator variable ξ: and this outcome of the recombination happens with probability C(ξ).The total rate of the individual event is hence rQ(g 1 , g 2 )C(ξ).The change of the distribution over genotypes due to recombination is given by In practice it is hard to use (17) without assuming that the pair probabilities factorize, as in gas collisions at low densities in statistical physics.We assume for simplicity also that Q depends only on the overlap q between the two genotypes g and g : Recombination as modelled above does not change the overlap.This can be seen as follows::

As the indicator variable takes values zero and one this gives 1 l (1)
i ,l (2) i = 1 li,l i .By this invariance of overlaps and the factorization assumption the right-hand side of (17) simplifies to: where we have used the abbreviations The first of these is the probability that two loci are inherited from the same parent and does not (for this model) depend on the genotype g.The last three averages on the other hand depend on g.However, if the function Q is not too sharply focused the dependence can be taken weak.In particular, we assume that Q is self-averaging, and essentially does not depend on g.In spin glass physics language we hence assume that Q , Q1 l i ,α and Q1 l i ,α 1 l j ,β are self-averaging in the "paramagnetic" phase where QLE is expected to hold.

V. EVOLUTION EQUATION FOR LOG-PROBABILITY IN QLE
In QLE the evolution equation can conveniently be written for the logarithmic probability and the various terms identified.Fitness enters (24) as ) and if there were higher-order terms in fitness (more than pair-wise dependencies) they would enter higher than quadratic terms in the QLE distribution in the same way.Ignoring mutations and genetic drift we have for the pairwise dependencies Ji,j (α, where the contribution from recombination is simply read off from (19).( 26) is a relaxation equation which pushes the Potts model parameter J i,j (α, β) to be the ratio of two rates, (3) above.When the data is from one population in a stationary state the average relative rate Q can be subsumed in the overall rate r.When gentoypes are Boolean vectors this gives the same result as Eq. 25 in [27].
As observed above recombination does not change single-locus frequencies, and without mutations the f i (α) will drift towards fixation (taking values 0 or 1).Once the population has reached fixation at locus i there can no longer be any non-zero correlation of Potts parameter involving i, and in such a setting QLE is therefore only a long-lived quasi-stationary state (for the correlations, and for the J i,j (α, β)'s).Note that by (19) recombination terms enter in the evolution equations of h i (α), in combination with the quantities J i,j (α, β).This is no contradiction, because when correlations are non-zero there is not a one-to-one relation between single-locus frequencies (f i (α)) and Potts model magnetization parameters (h i (α)); recombination can influence the latter but not the former.
The break-down of the relaxational equation ( 26) when the single-locus frequencies go to fixation can be understood as follows.In such a setting the J i,j (α, β)'s would first remain of order unity, while the h i (α)'s would tend to ±∞.When the h i (α)'s become large enough that the minor alleles in a finite-N population are likely to be present only in a few copies, a few random events can remove all of the remaining ones at once, which sets the correlation and the J i,j (α, β) to zero in one go.Alternatively the argument can be made starting from Eqs. 37-29 in [26], which are stochastic differential equations for the frequencies and pair-wise correlations (f i (α) and M ij (α, β) in our case).When translated to equations for h i (α) and J i,j (α, β) near fixation the noise will be large, which destabilizes (26).

VI. DCA FOR WHOLE-GENOME SEQUENCING DATA
A set of whole-genome sequences of the human pathogen S. pneumoniae obtained in the Maela collaboration (Materials & Methods) can be represented as about 3, 000 genotypes of about 100, 000 loci each.Correlations and Potts model terms obtained from this data have qualitatively different distributions, as shown in Fig 1 and Fig 2.
The number of correlations larger than a cut-off c grows quickly when c decreases below its maximum value, while the cumulative distribution of inferred Potts model  couplings have a much more pronounced tail.This implies that the set of largest DCA couplings is better separated than the largest correlations from the unavoidable background due to under-sampling [33,34].Correlations also generally have a more uniform distribution across genomic distance, while the representation as a Potts model is more sparse ( Fig 3 shows pair-wise scatter-plots of correlations and DCA terms obtained by PLM and SIE.In all three cases the scatter-plots are "clouds of points", indicating that DCA and correlations measure different properties of the data.The scatter-plot of PLM vs correlations shows a weak trend, such that larger PLM scores are associated to larger correlations.Such trends are absent in correlations-SIE and PLM-SIE, and SIE values are also numerically large.In fact, the correlation matrix is under-sampled, hence smaller correlations reflect sampling noise and this is also an issue for SIE, as well as the sensitive dependence on almost fixed alleles in this procedure.PLM scores and correlations were compared graphically for this data in [21], with a cut-off excluding short-range interactions.

VII. THE QLE PHASE IS OBSCURED BY GENETIC DRIFT
Genetic drift is the random changes from one generation to the next due to chance events.In a finite population statistical genetics as described above only holds on the average; when following one population in time fluctuations of order N − 1 2 appear for observables such as single-locus frequencies and pair-wise loci-loci correlations.Fig. 4 reports simulations using the FFPopSim software showing that these fluctuations can in practice be quite large, even for populations that are not small.
According to the theory developed in [27] (Appendix C) dynamics of correlations is relaxational and the curves of correlations vs time hence should fluctuate around an equilibrium value, which is the one given in (3) above.The fluctuations in Fig. 4 are however large compared to the pair-wise fitness values, and DCA inference from instantaneous values of the ensemble correlations are not good predictors for pairwise fitness (data not shown).The dynamics of frequencies is not relaxational, and one may hence observe large changes where the population at one locus changes from one allele to another.A further conclusion is that SIE should not here be an appropriate inference procedure also because fluctuations in the frequencies are large and have long time scales; flavors of DCA that rely only on correlations should exhibit better performance.

VIII. DISCUSSION
The main question addressed in this work is if and when DCA can be expected to work for genome-scale epistasis analysis.We have given an answer in the context of statistical genetics: for a population evolving under recombination, mutations and fitness variations this is so when recombination is sufficiently fast.The joint distribution of the population over genotypes then approximately factorizes into a product of identical Potts distributions (2).Treating a set of genomes as independent samples from such a distribution allows to infer fitness parameters (F ij ) from Potts model parameters (J ij ) by inverting (3), and this is essentially what using DCA on such data means.We now discuss limits to the analysis and further directions.
The first limitation is that DCA cannot be expected to yield meaningful results when recombination is weak.One example of such an effect was already given in [18] where also data from Streptococcus pyogenes was presented (Fig. 6 of [18]).Another example was recently given on Vibrio parahaemolyticus, a human gastrointestinal pathogen in panmixia i.e.where all strains are able to recombine, but having a very low overall recombination rate [35].The problem of inferring fitness from data on populations that are in the CC phase appears to be both conceptually and practically important; we hope to be able to return to such questions in future work.
A second limitation concerns finite populations, particularly simulated data, where the population has to be of moderate size.According to the theory developed for Ising genomes in [27] (Appendix C) and qualitatively confirmed above in Fig. 4, frequencies and correlations follow stochastic differential equations with noise strength scaling as N − 1 2 .In principle Potts model parameters (h i (α) and J ij (α, β)) for categorical data also follow stochastic differential equations, but of a more complicated form due to the inverse Ising/Potts relations.Applying the DCA procedure to finite-N data thus requires parameter inference from a high-dimensional stochastic time series with a complicated deterministic part.This may not be an easy task.
A third limitation is the neglect of spatial and environmental separation.Bacteria such as the human pathogen Helicobacter pylori readily recombine if they meet, but can only do so when their human host populations overlap [36].Allele frequencies may be different for different bacterial populations, reflecting differences in the host populations and environments.If data from the different populations is pooled, this will be a confounding factor for some flavors of DCA e.g. for PLM and SIE, These types of issues appear to merit further study.

FIG. 1 .
FIG. 1. Cumulative distributions of (a) correlations; (b) pseudo-likelihood maximization (PLM); (c) small-interaction expansion (SIE); Semi-logarithmic scale visualizing the distributions of of approximately 10 10 elements.The scalar value associated to each pair of loci i and j is the Frobenius norm of the 3 × 3-correlation matrix (case a) or the Frobenius norm of the inferred Potts model 3 × 3-matrix element (cases b and c).

FIG. 2 .
FIG. 2. Distributions as functions of genomic distance: (a) correlations; (b) pseudo-likelihood maximization (PLM).(c) small-interaction expansion (SIE).Same data and same norms as in Fig. 1.Figures show averages in windows of genomic distance.Blue: maximum, red: top-1%, yellow: top-5%, violet: top-10%, green: mean, all in window.The curves in (b) show a sharp initial decrease with genomic distance which generally much lower values beyond genomic distance 10 3 where recombination can be expected to act effectively.
Fig 2 (a) and (b)),.The first-order perturbative version of DCA employed in [26, 27] here called SIE gives in both instances results closer to correlations, see Fig 1 (c) and Fig 2 (c).

FIG. 3 .
FIG. 3. Pair-wise scatter plot of correlations, SIE and PLM: (a) PLM vs correlations; (b) SIE vs PLM; (c) SIE vs PLM.Same data and same norms as in Fig 1 and Fig 2. The numerical scale in each direction depends on the details of the norms and inference procedure, e.g P LM scores depend on L2 regularization parameters.Correlations and PLM scores are numerically similar while SIE scores are not, as discussed in text.

TABLE II .
4arameter settings for the simulations reported as Figs.4in main text.