GIFT: new method for the genetic analysis of small gene effects involving small sample sizes

Small gene effects involved in complex/omnigenic traits remain costly to analyse using current genome-wide association studies (GWAS) because of the number of individuals required to return meaningful association(s), a.k.a. study power. Inspired by field theory in physics, we provide a different method called genomic informational field theory (GIFT). In contrast to GWAS, GIFT assumes that the phenotype is measured precisely enough and/or the number of individuals in the population is too small to permit the creation of categories. To extract information, GIFT uses the information contained in the cumulative sums difference of gene microstates between two configurations: (i) when the individuals are taken at random without information on phenotype values, and (ii) when individuals are ranked as a function of their phenotypic value. The difference in the cumulative sum is then attributed to the emergence of phenotypic fields. We demonstrate that GIFT recovers GWAS, that is, Fisher’s theory, when the phenotypic fields are linear (first order). However, unlike GWAS, GIFT demonstrates how the variance of microstate distribution density functions can also be involved in genotype–phenotype associations when the phenotypic fields are quadratic (second order). Using genotype–phenotype simulations based on Fisher’s theory as a toy model, we illustrate the application of the method with a small sample size of 1000 individuals.


Introduction
Identifying the association between phenotypes and genotypes is the fundamental basis of genetic analyses. In the early days of genetic studies, beginning with Mendel's work at the end of the 19th century, genotypes were inferred by tracking the inheritance of phenotypes between individuals with known relationships (linkage analysis). In recent years, the development of molecular tools, culminating in highdensity genotyping and whole genome sequencing, has enabled DNA variants to be directly identified and phenotypes to be associated with genotypes in large populations of unrelated individuals through association mapping. Genome-wide association studies (GWAS) have become the method of choice, largely replacing linkage analyses, because they are more powerful for mapping complex traits, that is, they can be used to detect smaller gene effects, and they provide a greater mapping precision as they depend on population-level linkage disequilibrium rather than close family relationships. For example, the 2021 NHGRI-EBI GWAS catalogue currently lists 316 782 associations identified in 5149 publications describing GWAS results [1]. Additionally, extensive data collection has been initiated through efforts such as the UK Biobank [2], Generation Scotland [3] and NIH All of Us research programme (https://allofus.nih.gov/) with the expectation that large-scale GWAS will elucidate the basis of human health and disease and facilitate precision medicine.
While genomic technologies used to generate data have rapidly advanced within the last 20 years, the statistical models used in GWAS to analyse the data are still predominantly based on Fisher's method published than 100 years ago [4,5]. Using probability density functions (PDFs) and in particular the normal distribution, Fisher's method partitions genotypic values by performing a linear regression of the phenotype on marker allelic dosage [6]. The regression coefficient estimates the average allele effect size, and the regression variance is the additive genetic variance due to the locus [7]. While Fisher's method has been improved, for example using conditional probability linked to potential prior knowledge of genetic systems (Bayes' method) [8,9], the overall determination of genotype-phenotype mapping is still grounded on PDFs. However, the use of PDFs become problematic in the case of complex/omnigenic traits as they require large scale-study or equivalently, large sample size.
The results obtained by GWAS have demonstrated that complex traits are driven by a vast number of tiny-effect loci, namely a vast number of genes each with tiny-effect, and not by a handful of moderateeffect loci as initially thought. In turn, this has led to a re-conceptualisation of the genetic basis of complex traits from being polygenic (handful of loci/genes) to omnigenic (vast number of loci/genes) [10][11][12][13][14][15][16]. Although the omnigenic paradigm is central to further our understanding of biology, there is a practical issue concerning the extraction of information to relate genotype to phenotype in this case. Indeed, tiny-effect loci (i.e., very small gene effects) necessitate a remarkably large population to extract information. Figure 1 exemplifies the limit of GWAS with a restricted sample size of 1000 individuals. This issue regarding the need for large sample sizes was present, but dismissed, in Fisher's seminal work [4] as he assumed an 'infinite population' from the start to use the normal distribution density function in the continuum limit. This assumption allowed him to provide a method able to extract, in theory, the genetic information required to map any genotype to phenotype.
While one may assume an infinite population mathematically, in practice this comes at a huge cost. To give a 'real-life' example of the sample size needed to study complex traits the best is to turn to the phenotype 'height' in humans. The phenotype height in humans is a classical quantitative trait that has been studied for over a century as a model for investigating the genetic basis of complex traits [4,17] and whose measured heritability is well known [4,18,19]. However, this phenotype has remained controversial [12] for a long time as current association methods were not been able to fully recover the heritability measured [21,22]. While different reasons were put forward to explain this discrepancy including, for example, too restricted sample sizes, too stringent statistical tests or the involvement of the environment [6,23]; this point seems to have been resolved only recently. Using a population containing a staggering 5.3 million individuals a recent study claims to have captured nearly all of the common single nucleotide polymorphisms (SNPs)-based heritability [24].
This important study confirms that the precision of current quantitative genetic methods to determine omnigenic traits comes at an astronomical cost in line with the assumptions used, namely the need for a staggeringly large (near infinite) population. In this context one may wonder whether such large-scale study will ever be replicated in any other species and in particular those near extinction where small sample sizes need to be considered. Alternatively, one may try to understand where the need for large sample sizes comes from and determine whether it is possible to extract information linking genotype to phenotype in a different way.
There is a very good reason as to why large populations will always be required for omnigenic phenotypes when GWAS is used. As mentioned above, the reason is rooted in the fact that GWAS is mostly based on frequentist probabilities a.k.a. PDFs. Indeed, GWAS is based on statistics and, by definition, statistics deals with the measurement of uncertainties [25]. To draw inferences from the comparison of large datasets, a method that requires some understanding of its accuracy, including ways of measuring the uncertainty in data values, is needed. In this context, statistics is the science of collecting, analysing, and interpreting data, while PDFs defined through the notion of relative frequencies, is central to determining the validity of statistical inferences. In practice, the use of frequentist probabilities (or PDFs) and the resulting categorisation of data is justified when inaccuracy exists in experimental measurements. For example, measuring a continuous phenotype such as the height of individuals with a ruler with centimetre graduations, that is, to the nearest centimetre, warrants the use of frequentist probability (or PDFs). In this case, a frequency table of phenotypic values can be defined through 1 cm-width bins or categories, from which the PDFs of the phenotype height and of the genotypes can be deduced to address the statistical inferences. However, the precision available for the inferences will always be, at best, given by the width of categories created and linked to the experimental precision achieved (1 cm in this case). Consequently, if instead of using a ruler with centimetre graduations one was using a ruler with millimetre graduations to increase the precision in inferences, a larger sample size would be required such as to match the new 1 mm-width of categories to reform the PDFs. The trading between the sample size and the precision achieved by GWAS is known as 'study power' and its raison d'être is linked to the fact that the entire field of probability, and therefore the PDFs, has been conceptualised mathematically to represent the fact that information on a system is limited. It is for Figure 1. Phenotype human height simulated using the real data from the GTEx project [28]. We recall that for diploid organisms, such as humans, and for a binary (bi-allelic, A or a) genetic marker, any microstate (genotype) can only take three values that we shall write as '+1', '0' and '−1' corresponding to genotypes aa (red), Aa (grey) and AA (blue), respectively. In the figures the letter, p, refers to the allele frequency and, σ g , to the genetic variance. Table 1 provides a relation between the two aforementioned variables. (A) Phenotype histograms for the three genotypes (microstates), showing the '+1' state in red, '0' state in dark grey and the '−1' state in blue. The overall phenotype distribution is in light grey at the back. The mean genotypic (microstate) values are −a, d, and +a. In genetics, a and d, are defined as the gene effect and the dominance, respectively. (B) Example of limitations linked to current GWA methods when the gene effect, a, changes by one order of magnitude from a = 1 to a = 0.1 while the respective number of microstates does not. In this context, the three populations of genetic microstates collapse, and their separation becomes very difficult to characterise unless the size of the population is increased to reduce the width of categories. In this example, it is almost impossible to distinguish between a = 0.1 and a = 0.01. Note that in this case the gene effect has been normalised by the phenotypic standard deviation of the population and there is no dominance (d = 0). (C) Corresponding distributions of the different genetic microstates singled out based on the phenotype values they characterise. The method of averages, or method of moments, advocated by Fisher suggests plotting a straight line as best fit between the three distributions out of which the slope of the straight line is then indicative of genotype-phenotype association. However, the method of averages discards the information that is available in the spreading of data for each genotype. The method we suggest will make use of this information to describe genotype-phenotype association. Note that dominance can occur (d = 0) at which point and with the method of average a linear regression of the genotype means, weighted by their frequency, on the number of alleles needs to be performed to provide a new intercept and slope (see [6]). this reason that the normal distribution was known before as the 'error function' or 'law of errors', where the term 'error' is defined experimentally (see rulers above). Accordingly, the creation of categories implies that a sort of 'imprecision' is necessary.
While the notion of imprecision can be genuine (see rulers above), the act of creating categories to use PDFs when precision in experimental measurements is available is not fully justified and can be seen as an act of 'wilful ignorance'. This is so because information is lost by slotting different phenotypic values into the same category. To exemplify this point let us take an example, imagine a species close to extinction (very small population, say 50 individuals) and that it is possible to measure phenotypic values with very high precision, for example, using highly advanced imaging techniques or biosensing technologies [26]. In this case, each measured individual could return a unique phenotypic value. Consequently, reforming categories to reform and use PDFs would mean embracing the relatively large width of categories created leading to the impression that a large imprecision is present, even so such imprecision did not exist in the first place. One may argue that PDFs, such as the normal distribution in Fisher's theory, are not required since averages and variances can be mathematically calculated directly from data without the need to recreate PDFs. However, this argument is not valid as averages and variances (and any other moments) cannot be dissociated from PDFs, since they are the ontological parameters that define PDFs, and that PDFs are used to determine statistical inferences in the field of quantitative genetics. Consequently, thinking in term of averages and variances necessitate to conceptualise and hold valid PDFs, i.e., categories, to describe any system. Therefore, there is a need to formulate new methods using the full information generated through accurate and highly precise genotyping/phenotyping when sample sizes are small which does not require the categorisation of data. In fact, this problem is equivalent to finding a way to resolve genotype-phenotype mapping by assuming a finite-size population with phenotype values measured precisely enough to rule out the possibility that two phenotype values are in the same category. Taking this challenge as the starting point, a new and relatively simple method for extracting information for genotype-phenotype mapping can be defined. While this method is remarkably simple when explained in lay terms, its theoretical framework requires the introduction of a new concept called 'phenotypic fields'. Phenotypic fields can also be defined within the context of Fisher's theory.
The remainder of this paper is organised as follows. In the first part, an intuitive approach to the method genomic informational field theory (GIFT) is presented in which one shall see that association between datasets (i.e., genotype and phenotype) can be analysed in specific way that do not involve the use of means and variances necessarily, but phenotypic fields instead. This is followed by a second part stating and explaining the necessary ingredients from physics (entropy, energy and field) and how they must be combined, to provide GIFT. Since GIFT is not too difficult to model, we have relegated the theoretical development of GIFT in the appendix A. Finally (third part) one will demonstrate how Fisher's seminal theory can be re-transcribed using GIFT. In particular one shall see that Fisher's seminal intuition corresponds to the simplest form of GIFT. Finally (fourth part), one will compare GIFT to GWAS using simulated genotype and phenotype to demonstrate that GIFT outperforms GWAS.

Position of the problem and heuristic presentation of GIFT a method
The practical issue regarding genotype-phenotype mappings with current statistical methods concerns the sample size needed to provide accurate/precise information when complex/omnigenic traits are involved. As stated in the introduction, this issue stems from the creation of categories historically linked to the notions of 'imprecision' or 'error' in measurements. At the dawn of the 21st century we are getting more precise in our measurements, and one may wonder what sort of scientific/mathematical tool we should be using if one were able to attain any level of precision wanted in cases where the population size studied is limited.
We recall here that for diploid organisms, such as humans, and for a binary (bi-allelic, A or a) genetic marker, any microstate (genotype) can only take three values that we shall write as '+1', '0' and '−1' corresponding to genotypes aa (homozygote), Aa (heterozygote) and AA (homozygote), respectively.
One way to proceed to develop a method embracing precision is to start by looking at how density distribution functions are transformed when precision in phenotypic measurements increases. From figure 2(A), the conclusion is obvious, the bar charts are transformed into code bars, where each bar originates from a particular phenotype value representing one individual from the population studied. This result is expected since when the width of categories decreases due to an increase in precision in measurements, there will be a point where there can only be one individual per category. To extract information from the code bars represented by the bottom right chart in figure 2(A), let us now wonder what it means to have information on the phenotype as opposed to have none.
To answer this question the best thing is to further simplify the problem by considering the coloured bars only and not their spacing. Imagine, therefore, that a set of individuals has been genotyped and that those individuals are picked at random. That is, there is no information on any phenotype. Imagine also that one decides to concentrate, for example, on the genome position 1 000 000 on chromosome 4 for all the individuals since this genome position happens to display a biallelic SNPs across the set of individuals.
Thus, upon calling randomly but sequentially individuals, the genotypic information obtained in due course can be represented as a random string of genotypes including '+1', '0' and '−1' microstates (representing homozygote-AA, heterozygote-Aa and homozygote-aa). An example of such random configuration is:  The comparison of the two top charts in (A) and (B) demonstrates how genotype are associated with the phenotype, as in this case any phenotype category can be decomposed using the underlying microstate categories. However, grouping data into categories is legitimate so long that the width of the category is justified. The width of categories is justified provided that imprecision exists in phenotype measurements. However, a method based on the notion of imprecision has limited value when precision is available, and new methods are required in this case. Indeed, by increasing the precision in phenotype measurements it is possible to envision, in a near future, the possibility to deal with genotype and phenotype under the form of 'code bars' ((A) and (B), bottom-charts) as opposed to PDFs. The question is then, how can information be extracted from those 'code bars'? (C) To analyse the 'code bar' in the bottom charts of (A) and (B), we rewrite it as a string of microstates. The first thing to note is that the way the microstates '+1', '0' and '−1' appear in the string (a.k.a. configuration/ordering), is linked to the fact that: (i) the genome position considered is associated with the phenotype and, (ii) phenotypic values are ranked as a function of their magnitude. Accordingly, if the genome position considered is not associated with the phenotype, the configuration would appear random (top string) as opposed to being ordered if a genotype-phenotype association exist (bottom string). (D) Thus, to determine as to whether a genome position is associated with the phenotype, the best way is to plot the cumulative sum of microstates and to compare it to the plot of its 'scrambled'/random configuration. When the string is ordered, a curve will emerge as shown by θ(i). On the contrary, when string is 'scrambled' a straight line is expected as shown by θ 0 (i). The straight line resulting from scrambling of the string is expected as the presence probabilities of pulling '+1', '0' and '−1' from the scrambled string are constants in this case. The curve (ordered string) can be modelled by considering that the information about the phenotype acts like a 'field' to change the configuration of microstates from being disordered to being ordered. The two curves intersect at the last position in the string because the numbers of '+1', '0' and '−1' in either string (ordered or scrambled) is conserved. The difference between the ordered and disordered strings provides a precise measure of the genetic influence on the phenotype. In figure 4(A) we provide a plot of such a difference in the phenotypic space.
Note that the order in which the individuals were called is linked to the position in the string. Let us now repeat the same experiment using the same individuals in a context where accurate information on a chosen phenotype is available. That is, we call the individuals as a function of the magnitude of their phenotype we consider. For example, if the phenotype is height, one starts by calling the smallest individual and all subsequent individuals through successive increments in their phenotype height. Note again that because each individual has a unique phenotype value there is no possibility for two individuals to be called at once.
If the genome position 1 000 000 on chromosome 4 is involved in the formation of the phenotype, then one would expect a change in the configuration of the string of microstates based on the fact that homozygotes would be found at the extremities of the string and heterozygotes towards the middle (see figure 2). An example of such a string would be, for example: Thus, the only thing that changes between the random and the phenotype-ordered configurations is the way the genetic microstates are allocated to positions in the string. However, as the genome position 1 000 000 on chromosome 4 is the only one that has been considered, the two configurations contain the same number of '+1', '0' and '−1', since the same individuals were considered between the two configurations.
The ansatz is then to consider the cumulative sum of microstates as a function of the position in the string. Indeed, it is clear from the examples given above that if one starts by adding the microstates together, differences will be seen in the resulting cumulative sums. To give an example, let us consider the two strings above and note 'θ 0 (i)' and 'θ(i)' the cumulative sums of microstates in the random and ordered configurations, respectively, where 'i' is the position in the string. Then adding the microstates starting from the left side of the strings one finds: As a result, the difference 'θ(i) − θ 0 (i)' is expected to be indicative of the importance of the phenotypic information and how gene microstates are related to the phenotype. The fact that the same individuals were considered in both configurations also imposes a conservation relation under the form: θ(N) − θ 0 (N) = 0. One shall call the cumulative sums: 'genetic paths' whose mathematical definition will be précised below. To conclude, it is the information on the phenotypic values that provides a change in the configuration of microstates and one can start developing the formulations of, θ 0 (i) and θ(i).
Noting, N +1 , N 0 and N −1 the number of genetic microstates '+1', '0' and '−1', respectively. The genetic microstate frequencies for genome position 1 000 000 on chromosome 4 are defined by, When the positioning of the genetic microstates in the string is performed in a random fashion, the probabilities of finding '+1', '0' or '−1' as genetic microstate at any position are ω 0 + , ω 0 0 and ω 0 − , respectively. The resulting cumulative sum is then: is therefore a straight line. We shall call θ 0 (i) the 'default genetic path'(figure 2(C)).
As a result, the signature of a gene interacting with the phenotype when considering the two aforementioned genetic paths is the difference: One can then be a little bit more prescriptive by introducing the notion of phenotypic fields.

A physics-inspired model for GIFT: notion of 'phenotypic fields' and resulting difference between the phenotype-responding and default genetic paths
The difference, θ(i) − θ 0 (i), can be described using field theory. Indeed, as the only difference between the two configurations is the information linked to the phenotypic values, the phenotypic information can be thought as an external field impacting the configuration of microstates. To provide a physicsinspired definition of genotype-phenotype mapping, let us reconsider the random string above and assume that the set of individuals in the string are particles and that the different genetic microstates '+1', '0' and '−1' are their physical properties. One can then assume that it is those properties that interact with the field. Note that contrary to physics where a single field is defined, one needs in our case to define one field per microstate. One shall note by u + (Ω), u 0 (Ω) and u − (Ω) the phenotypic fields acting on the microstates '+1', '0' and '−1' respectively. Note that the variable Ω represents the phenotypic values measured precisely. By assuming further that the particles cannot interact together and that, when they are not forced into a specific configuration by the fields, they can hop and exchange positions when the field is null (similar to a diffusion/thermal process), one can then model the string of microstates as a closed system. Figure 2(C) provides an idea of how the 'phenotypic fields', when non null, impacts on the configuration of microstates by seggregating them. With those assumptions and using basic principles from statistical physics it is then possible to model the presence probability of microstates at any position in the string.
Thus, after re-expressing the genetic paths in the space of phenotypic values since the fields are function of the phenotypic values (appendix A.1), one can then construct functionals representing the entropy (appendix A.2) and the total interaction energy between the microstates and the subfields (appendix A.3). Finally, one can optimise a functional similar to the free energy to express how the fields are related to the asymmetry of states (appendix A.4). Consequently, one can demonstrate the familiar result concerning the presence probability of microstates expressed in the phenotypic space as, Where the hat ' ' is added to insist on the fact that the presence probabilities of microstates are expressed in the space of phenotypic values (and not positions) and, δu (1)-(3) are familiar to physicists when dealing with Boltzmann's weigh in statistical physics. Note that the default genetic path is defined when the fields are null. Noting, Δθ(Ω) defθ(Ω) −θ 0 (Ω), the difference between the phenotype responding and default genetic paths expressed in the phenotypic space, Δθ(Ω) is therefore a function of the difference between equations (1) and (3). One can then make the symmetries of the problem more apparent by defining for the genetic microstates, In this case, using hyperbolic functions one deduces (see appendix A.4 for development), Where, th(Δu(Ω 0 )) def Δω 0 ω 0 The new variable 'Ω 0 ' is the phenotype value corresponding to the condition ω + (Ω 0 ) ∼ω − (Ω 0 ) and the meaning of the constant 'α 0 ' can be related to the Hardy-Weinberg law from population genetic. Hardy-Weinberg law based on random mating in a population provides a relationship between the genetic microstate frequencies under the form: p 2 + 2pq + q 2 = 1, where p 2 and q 2 are the genotype frequencies of genetic microstates '+1' and '−1', i.e. homozygote genotypes aa and AA, respectively; and 2pq the genotype frequency for genetic microstate '0', i.e. the heterozygote genotype Aa. In our case, this corresponds to replacing p 2 , q 2 and 2pq with, respectively, ω 0 + , ω 0 − and ω 0 0 . Consequently, the Hardy-Weinberg law imposes α 0 = 1 with α 0 = 1 corresponding to a deviation from the law. However, this term is expected to remain stable upon any changes of allele or genotype frequencies suggesting therefore that, genetically, any changes in 'Δω 0 ' are to some extent compensated by corresponding changes in 'ω 0 '. Finally, using equation (4) one deduces the difference between the phenotype responding and default genetic paths expressed in the phenotypic space, Δθ(Ω), as (appendix A.5): Where Ω 1/N is the smallest phenotypic value measured and, Δ(x), is the spacing between individuals in the code bar figure 2(A) that can be related to the PDF of the phenotype when the population measured is dense (see appendix A.1 and SM1 in the supplementary materials). Finally, the conservation of genetic microstates needs to be added, that is, Δθ(Ω 1 ) = 0, expressed as, Where Ω 1 is the largest phenotypic value measured. Therefore, as α 0 is constant since a single genome position is considered, the genetic paths difference can be re-expressed integrally using two independent reduced phenotypic fields, i.e., Δu(Ω) and u(Ω), and equation (6) provides a coupling between those fields and Δ(Ω). The advantage of using fields is the reduction of unknown parameters involved in the problem and the possibility of laying out genotype-phenotype associations based on fields' symmetry. For example, and as a minimalist model, one may wonder what sort of expression would take the fields if the reference field u 0 (Ω) was null and, sδu + and sδu − , were acting anti-symmetrically and linearly on the microstates '+1' and '−1'? This minimalist model can be developed (see SM2 in the supplementary materials) and is similar to Fisher's seminal intuition concerning genotype-phenotype associations (see below). Our aim is now to demonstrate that the idea of genetic paths mediated by phenotypic fields already exists in Fisher theory. This can be shown by coarse graining the paths.

Coarse graining GIFT
To derive a coarse-grained version of GIFT, that is, a genetic path difference for GWAS, we assume the existence of categories or bins and concentrate on the interval of phenotype values ranging between Ω and Ω + δΩ defining one particular category or bin.
Based on frequentist probability, by noting by N the total number of individuals in the population we can define by: δN ∼ N · P Ω (Ω)δΩ, the number of individuals in the phenotype category concerned namely with a phenotype value ranging between Ω and Ω + δΩ.
The design of categories generates two conservation relationships: the first one concerns the total number of individuals and microstates, namely, that for a given genome position, the sum of all possible microstates is also the sum of all individuals. This relationship is written as follows: The second conservation relationship is linked to the category considered. The number of individuals in the category concerned is also the sum of the microstates in this category: δN = δN + + δN 0 + δN − . Consequently, the conservation relation concerning the number of individuals and microstates in the concerned category can be rewritten as, Using the PDF of both microstates and the phenotype defined above, the following is deduced: From equation (8) all the moments of microstate distributions are related to those of the phenotype distribution. Let us note by, Ω , and, a + , a 0 , a − , the average values of the phenotype and microstates '+1', '0', '−1' distribution functions, respectively; and by σ 2 and σ 2 + , σ 2 0 , σ 2 − the variances of the phenotype and microstates '+1', '0', '−1' distribution functions, respectively. From equation (8), one deduces the conservation relations for the first two moments in the form: The relations provided by equation (9) are valid by definition, namely, whatever PDFs are involved. While Fisher never formulated a conservation similar to the second from equation (9b), in his seminal paper [4] he used the notation α 2 to denote the genetic variance in the form: Let us now define the coarse-grained version of equation (4) by noting, δω + (Ω) − δω − (Ω), the difference in the presence probability of microstates '+1' and '−1' for the category of interest, it is then deduced: Note that equation (10) (8), we deduce using the continuum limit, Direct mapping of fields can then be performed between GWAS and GIFT (SM3 in the supplementary materials). As a result, we can define the coarsegrained versions of the difference in the genetic paths using the continuum limit as: Equation (11) demonstrates that Δθ(Ω) is sensitive to the PDFs involved as a whole and not just to the average values. In other words, the variance of microstates and their average values will impact on genotype-phenotype associations. Note that in equation (11) the integration interval is unchanged. However, the convergence in probability of distributions allows some freedom, for example changing the The fields can then be defined in Fisher's context setting: −δu + (Ω) def ln P + (Ω)/P 0 (Ω) and −δu − (Ω) def ln P − (Ω)/P 0 (Ω) ; and from those relations it is deduced: u(Ω) = ln P 0 (Ω)/ P + (Ω) · P − (Ω) and Δu(Ω) = 1 2 ln P − (Ω)/P + (Ω) .
Finally using equation (5)'s notations one deduces: The significance of these fields can now be addressed. The field u(Ω) describes local deviations from the Hardy-Weinberg law, valid for each bin or category of phenotype values. For example, if a population was under no selection and random mating occurred, then the whole population would follow the Hardy-Weinberg equilibrium law, that is, However, selecting a particular bin or category of phenotype values would demonstrate a local deviation of this law, given by the term e u(Ω) = P 0 (Ω)/ P + (Ω)P − (Ω).
The signification of Δu(Ω) can be addressed using Fisher's approach.

Definition of fields using Fisher's theory
In his seminal paper [5], Fisher hypothesised that in a context where the population is infinite to use the normal distribution, the genetic variance 'α 2 ' is much smaller than the phenotype variance and that the variances of microstate distribution density functions for each gene are similar to that of the variance of the phenotype. While his hypothesis can be understood intuitively when all distribution density functions nearly overlap, it can also be demonstrated using equation (9b). Indeed, assuming α 2 σ 2 implies σ 2 − α 2 ∼ σ 2 and therefore σ 2 ∼ ω 0 Fisher did, is one valid solution. However, the relation σ 2 ∼ ω 0 In equations (13a) and (13b) the term δa − − δa + = a − − a + = 2a (see figure 1(A)) is known as the 'gene effect' in GWAS. In equation (13c), the term 2δa 0 − δa figure 1(A)) is the dominance as defined in the GWAS. In his seminal paper, Fisher considered: d ∼ 0.
Altogether, these results demonstrate that Fisher's theory can be described by phenotypic fields and genetic paths. As it turns out Fisher's model corresponds to the minimalist model aforementioned (SM2 in the supplementary materials). Using these fields, it is also possible to determine a generic solution to equation (8) (see SM4 in the supplementary materials).

Implication for small gene effects
Complex traits involve genes with very small effects that are difficult to characterise. The aim is to determine the resulting difference in the genetic paths in this case, that is, when the gene effect, a = (a − − a + )/2 (see definition above), tends towards zero: a → 0. Because PDFs are used, the integration interval can be altered using the convergence property of the distributions. In this context, the conservation of genetic microstates (equation (6)) can be written using Fisher's fields as, Consider, as Fisher did in his seminal paper [4], a phenotype distribution of the form, , and rescale the phenotype values in the integral using, a/σ, as a scaling parameter as follows: By taking the limit a → 0, the rescaled phenotype distribution becomes a Dirac distribution, dominating any convergences; thus, the left-hand side can be transformed as: Therefore, small gene effects imply: δΩ 0 /σ 1. Recalling that Δω 0 /ω 0 = th a σ δΩ 0 σ , one also deduces lim a→0 th a σ δΩ 0 σ ∼ 0, that is, small gene effects always involve common allele frequencies, namely Δω 0 /ω 0 → 0. Using this result, the genetic paths difference can then be developed when gene effects are small and by assuming that P Ω Ω 1/N 1 and that 'P Ω (Ω)' is normally distributed one obtains at the leading order: Equation (17) shows that in the context of Fisher's theory, a small gene effect corresponds to an overlapping symmetry between the genetic microstates and the phenotype distribution, with an amplitude proportional to the gene effect.

Fields linked to the variance of microstates
The involvement of variances in microstate distribution functions in genotype-phenotype associations is a highly debated matter (see [20] and references within). As mentioned above, the expression of the difference of the genetic paths considers the distribution density function as a whole, including the role of the microstate variances. In this context, we saw that equation (9b) provides a relation between variances in the form of an ellipse. Assuming a single variance for all microstate and phenotype distributions, as Fisher did, is plausible; but other solutions exist that would not violate equation (9b). In this context, let us imagine that the gene effect and dominance are nulls but that the distribution density function of microstates '+1', '0', '−1' and of the phenotype have distinct variances; written, respectively, as: . By noting λ + = σ/σ + , λ 0 = σ/σ 0 and λ − = σ/σ − , the fields can be mapped under the form: Consequently, pseudo-gene effect and pseudo-dominance linked to the variances of genetic microstates can be defined, respectively, as: To conclude, as this new method does not only concentrate on average values, it captures more information as far as genotype-phenotype associations are involved.

Illustration of the application of GIFT using simulated data
We intend to illustrate how GIFT can be applied using simulated data and qualitatively assess its sensitivity to extract information.

Data simulations
The codes used are provided in SM5, see the supplementary materials. Data was simulated according to quantitative genetic models defined by Falconer and Mackay (1996) [27]. A single bi-allelic quantitative trait locus associated with a continuous phenotype was modelled, with an additive allele effect, a, and allele frequencies, p and q, where p + q = 1. The simulation parameters were set as the number of individuals sampled, N = 1000; number of simulation replicates, n = 1000; allele frequency, p; additive allele effect, a, and dominance, d; note that the number of simulation replicates allows one to determine the best outcomes. While the theory provided in this paper is general, the simulation of data will be restricted to individuals' genotypes allocated according to Hardy-Weinberg proportions. For N individuals, Np 2 had genotype AA (corresponding to microstate −1), 2pqN had genotype Aa (microstate 0), and Nq 2 had genotype aa (microstate +1). The allele effect, a, is defined as half the difference between the +1 and −1 genotype (microstate) means, and d, is the position of the 0 genotype (microstate) mean (figure 1). Dominance is measured as the deviation of the mean of microstate 0 from the midpoint between the means of the +1 and −1 microstates. For the purposes of the simulation dominance, d, was 0, that is, the mean of microstate 0 was mid-way between the mean of microstates +1 and −1.
The additive genetic variance due to the quantitative trait loci (σ 2 QTL ) was defined as [27]: σ 2 QTL = 2pq a + d q − p 2 . Each individual was assigned a genotypic value, depending on their microstate: −a for the +1 microstate, 0 for the 0 microstate, and +a for the −1 microstate. Individual phenotypes were generated by adding a random environmental effect to the genotypic value of each individual. The added environmental effect was a random variate drawn from a normal distribution with a mean of 0 and variance of 1 − σ 2 QTL . The phenotype was then rescaled to a value representing a realistic dataset: phenotype = (simulated phenotype × standard deviation of real data) + mean of real data. In this case, the real dataset modelled was a summary of the genotype-tissue expression (GTEx) project [28]. In particular, the phenotype was height with a mean of 68.17206 inches and standard deviation of 4.030 07 inches. For each simulated replicate of N individuals, the difference between the cumulative sums of microstates ordered by phenotype value and genotypes in a randomised order with respect to phenotype was determined to create the difference in the genetic paths difference. The maximum value of this difference was identified and its position and phenotypic value in the ordered string of microstates were recorded. Where the maximum value extended over several positions, the mean position and phenotypic value were recorded. Finally, to simplify representation, the amplitudes of the genetic path differences were normalised by population size (N = 1000 in this case).
Note that the standard deviation(s) arising from genotype-phenotype simulations were not considered in the analysis that follows. Instead, we report a theoretical analysis of the convergence of the genetic path difference method, and its self-consistency, as well as its sensitivity to detect genotype-phenotype associations using simulations, in SM6 and SM7, respectively, see supplementary materials.

Analysis of simulated results
For information, table 1 shows how genetic variance, gene effect, i.e., a/σ, and allele frequency are numerically related using GWAS method. Similarly figures 3(A) and (B) represent, in the context of GWAS and for the allele frequencies p = 0.5 (Δω 0 = 0) and p = 0.8 (Δω 0 = 0.6) that will be used below as examples, the relationship between the power of the study, the gene effect and the sample size as described in [29]. Briefly, the power of a study is related to the concepts of type I and type II errors. A type I error (a.k.a. α) is rejecting the null hypothesis in favour of a false alternative hypothesis, and a type II error (a.k.a. β) is failing to reject a false null hypothesis in favour of a true alternative hypothesis. The power of a study is then the probability of avoiding a type II error. Mathematically, the power is defined by, 1 − β, where 0<β < 1. If the power is close to 1, i.e., β ∼ 0, the hypothesis test is very good at detecting a false null hypothesis. β is commonly set at 0.2, to provide a power ∼0.8 (or 80%). Powers lower than 0.8, while not impossible, would typically be considered too low for GWAS. The four primary factors affecting power are, the sample size, the significance level (or α), the variance/variability in the measured response and the magnitude of the effect of the variable. Only the first variable can be altered in a study since all the others are fixed by the genes. To conclude, power is increased when the sample size or effect sizes (gene effect) are increased. Accordingly, figure 3 demonstrates that 1000 individuals would not allow 80% power to be achieved unless the gene effect is sufficiently large, that is for a/σ 0.5.
Using simulated data, we can now represent the genetic paths difference and its log transformation for As shown in figure 4(C), the profile of the phenotype distribution density function is recovered with an amplitude that decreases as a/σ decreases. The red vertical dashed line in figure 4(C) represents the mean phenotypic value. Using the natural logarithm to transform Δθ(Ω) (figure 4(C)) to ln Δθ(Ω) (figure 4(D)) demonstrates that a difference between genetic paths can be seen for small gene effects.
One may then compare how perceptible the associations are using the new method by comparing figures 1(B) and (C) (method of averages based on Fisher's theory) and figure 4(D) for identical allele frequencies and similar gene effects. Recall that GWAS rely on determining difference in averages (see figure 1(B) or figure 1(C)). However, the determination of a difference in the microstate averages rely on a strong gene effect ( figure 1(B)) or a very large population (see figure 3) as otherwise the density functions of microstates collapse onto one. This is particularly visible when one compares the right-hand and lefthand graphs in figure 1(B) or figure 1(C). Thus, the results provided by figure 4 suggest that GIFT can be applied to 1000 individuals to return information regarding potential genotype-phenotype associations that would not otherwise be possible, or extremely difficult, with current association studies.
Concentrating on different allele frequencies given by p = 0.8 (Δω 0 = 0.6) as an example. Figures 5(A) and (B) are representations of Δθ(Ω) and its naturallog transformation for log-scale values of a/σ.
Differences are clearly visible between Δω 0 = 0 and Δω 0 = 0.6, since the phenotype values for which Δθ(Ω)'s are maximal have been shifted from the average value of the phenotype (indicated by the vertical red dashed line). This is not surprising because the simulation only imposed a set of genetic variances, without any constraint on the conservation of the average phenotype value.
However, the shift of the phenotype value for which Δθ(Ω) is maximal is of interest. As equation (17) demonstrates that for small gene effects, the genetic path difference should be proportional to the phenotype distribution, that is, the phenotype value for which Δθ(Ω) is extreme should be the average value of the phenotype.
Thus, to obtain a better visualisation of the impact of the gene effect on the positioning of the phenotype value Ω Δθ max for which the genetic path difference is maximal, a set of simulations were also performed based on allele frequencies  5(C)). Note that p ∈ {0.5; 0.6; 0.7; 0.8; 0.9} can be deduced from the symmetry around the average value of the phenotype. While the standard deviations obtained were not always negligible concerning Ω Δθ max , typically between 0.5 and 1 phenotypic standard deviation for small gene effects; figure 5(A) demonstrated trends toward the average value of the phenotype with small gene effects. Indeed, below the simulated gene effect of a/σ ∼ 10 −1 , the average value of Ω Δθ max was remarkably similar to that of the average value of the phenotype, marked by the horizontal black dashed line.
To confirm this trend for small gene effects, that is, a/σ 0.1, we varied the population size from N = 10 3 to N = 10 4 to determine the presence of potential variations in Ω Δθ max linked to the simulations. Results summarised in figure 5(B) demonstrate that the only difference was a reduction in the standard deviations obtained for Ω Δθ max for the simulated gene effects comprised between 0.01 and 0.1 (see arrows figures 5(C) and (D) pointing to different magnitude of the standard deviations). Namely, the initial symmetry of the phenotype distribution density function reappears, as expected (equation (17)).
Finally, equation (17) suggests that for small gene effects Δθ(Ω) is proportional to the gene effect a/σ in the form, Δθ(Ω) ∼  Tables 2 and 3 provide the estimations for both Ω and σ and setting K ∼ 1/ √ 2π, figure 6 provides a comparison between the gene effect from the simulations, a σ sim , and the gene effect deduced from equation (17), a σ theo . Thus, recalling that the phenotype average and variance of the population modelled are, respectively, 68.17 inch and 16.24 inch 2 ; tables 2 and 3 demonstrate that fitting the genetic paths difference as a function of phenotype values with a quadratic curve recovers the magnitude of the average and variance of the phenotype used for the simulations for most log-scale values of the gene effect. Furthermore, the amplitude of Δθ(Ω) is also indicative of the gene effects involved.

Discussion
In his seminal paper [4], Fisher provided a synthesis between the genetic inheritance of continuous traits and the Mendelian scheme of inheritance using statistics and probability. His theory has become a landmark in genetics and heredity and its conceptual framework is still used today. While statistics is a natural field to employ when dealing with large datasets, the interpretation of data as well as the inferences that can be drawn from it rely fundamentally on PDFs.
As the act of creating categories to work with density functions is acknowledging imprecisions, our aim was to devise a different method ruling out the need for category. This new theoretical method, inspired by physics and named GIFT, uses the concept of phenotypic fields, and concentrate on 'genetic paths' to extract information on genotype phenotype mappings. It is then important to discuss the conceptual similarities and differences between GWAS and GIFT. In term of conceptual similarities, we saw that the theory underscoring GIFT developed using Fisher's assumptions recovers key concepts from quantitative genetics, including: (i) the Hardy-Weinberg coefficient locally, (ii) the Hardy-Weinberg coefficient at the population level, (iii) the gene effect, (iv) dominance, and (v) small gene effects involving common allele frequencies [30]. In this context, GIFT and GWAS are similar. Finally, applying GIFT to simulated data based on Fisher's assumption proved its sensitivity for extracting information on genotype-phenotype associations when sample sizes and gene effects were small. The reason for not considering the dominance in the simulations is linked to the fact that realistic GWAS have shown that with small effect sizes/small gene effect (which is the main area of concern of the current paper), dominance effects are often too small, and an additive model as suggested by Fisher works well enough [31].
In term of conceptual differences, three essential points can be discussed. First of all, GIFT is more general that GWAS in the sense that the phenotypic fields can be any, namely do not have to be linear. The prescription of linear phenotypic fields in Fisher's context comes from the symmetry associated with using the normal distribution as a template for any distribution density function, together with the assumption that the phenotypic and microstate variances are identical [4]. When the constraint on the variances is released, the phenotypic fields become quadratic involving the variances as well as the averages. In this context GIFT has enabled us to define new parameters linked to microstate variances, that are, the pseudo-gene effect and the pseudo-dominance, which will probably help resolve controversies [20].
Secondly, in term of genetics what has been achieved so far is rather at odds with traditional ways of thinking about the notion of gene. Indeed, by defining the difference in genetic paths, Δθ(Ω), one can say that it is the phenotype, i.e., phenotypic fields or information, that organises the configuration of genotypes and not the converse. In genetics the tradition is to think of genes as causing phenotypes. Here, a different way of thinking is suggested since it is the variation in phenotype values, resulting in our ability to generate a ranking process, that interacts with the microstates. Therefore, the phenotype is able to 'select' a set of genetic microstates. Recall that microstates 'respond' to, or interact with, the phenotypic fields only if they are associated with the phenotype. Consequently, this model suggests considering a genotype-phenotype 'loop', a.k.a. self-consistency. That is to say that if genes cause phenotypes (traditional view) and that phenotypes select gene microstates (present view), then an equivalence exists between phenotype and genotype.
Supplementary materials contain more information and development concerning the convergence and self-consistency of GIFT (SM6).
Finally, Fisher's theory/GWAS has been built on considering the normal distribution. In general, 'real' density functions never come as normally distributed. Given that Fisher's theory gives biological meanings to average and variance only, to define the 'gene effect' and 'genetic/phenotypic variance' linked to heredity, respectively; there is no biological meaning to any other statistical/mathematical parameters describing real density functions, such as for example the 'skewness'. As GIFT uses curves, namely does not use average and variance as central parameters, this issue does not exist with GIFT. Said differently, GIFT frees GWAS from any preconceived idea of what statistics and probability applied to biology should be.
Taken as a whole, the work presented here is a first step suggesting that GIFT can be considered as a potential method for genotype-phenotype mappings. Supplementary material SM7 contains more information concerning the signal-to-noise ratio when GIFT is used and SM8 (see supplementary materials) provides an initial illustration of the application of GIFT using real data based on GWAS results.
However the authors agree with the fact that more work needs to be done to compare GIFT to the vast literature concerning GWASes. For example, at present the model is quite simplistic in the way that, by construction, it does not allow the easy incorporation of covariates. Future works will relate covariates, such as age or sex, and for the case of human populations, genetic principal components (to account for population structure). In addition, GIFT will be compared against well-known statistical tests as used in GWAS (e.g., z-test/chi-square as parametric test, or the Kolmogorov-Smirnov test as non-parametric).

Conclusion
A century ago, Fisher devised a statistical method to map genotypes and phenotypes, which was essentially based on the measure of uncertainty. We present here a method taking as a paradigm the fact that certainty can exist with the possibility to measure phenotype and genotype with very high precision. In an associated paper, we present a theoretical methodology based on Shannon's information enabling the significance of correlation using real genotype-phenotype data to be quantified [32]. To conclude, this new method (GIFT) opens a new way to analyse genotype-phenotype mapping.

Data availability statement
Code for the simulations presented in the manuscript can be found in the supplementary materials (see SM5). Genomic and phenotypic data analysed as part of this study in supplementary materials (see SM8) is already published and freely available online. The genomic data is part of the well-known Arabidopsis thaliana '1001 genomes' project (https://doi.org/10.1186/gb-2009-10-5-107) and can be found online here: https://1001genomes.org/data-center.html. The phenotype data was published in The Plant Journal (DOI: 10.1111/tpj.15177) and is freely available online via Ion Explorer here: https://bitbucket.org/ADAC_UoN/dr000081-webservice-ionome-seed-and-leaf-map/. The code used for the real-data analysis in SM8 in the supplementary materials can be found on Dryad, DOI: https:// doi.org/10.5061/dryad.vx0k6djtp.

A.1. Differential expression of the genetic path in the space of phenotypic values
We assume that the phenotype values are measured precisely enough such that each individual has a unique phenotype value noted Ω i . As the population is composed of 'N' individuals one defines,î def i/N and Ωî, as the new position and its corresponding phenotype value, respectively. Thus Ω 1/N and Ω 1 are the smallest and largest phenotype values, respectively.

A.2. Entropy of the string of microstates
Keeping the notations ω 0 + , ω 0 0 and ω 0 − , for the genetic microstate frequencies of a given genome position across the population of individuals, we aim to determine the expressions of ω + î , ω 0 î and ω − î given the information obtained upon ordering the genotypes as a function of phenotype values along thê i-axis.
The default genetic path, θ 0 , as a straight line is defined by the absence of information on phenotype values, which is similar to an absence of association between the genetic microstates and the phenotype values, leading to an apparent disordering of genetic microstates. One way to measure this disordering is by using the 'entropy' of the string of genetic microstates for the genome position considered. In this context, the entropy is given by calculating When information about phenotype values and their ranking is given and when the genome position considered is associated with the phenotype, S 0 is transformed to S where: +ω 0 (Ω) ln(ω 0 (Ω)) +ω − (Ω) ln(ω − (Ω))) 1 Δ(Ω) dΩ. (A.3) As a result, the entropy difference, S − S 0 when non-null provides information on whether the genome position is associated with phenotype values. Thus the difference, S − S 0 , can be thought of as a 'transformation' in a physical/thermodynamic sense. That is, the difference in entropies must be balanced by a term that is linked to the association (or interaction) between the genetic microstates and the phenotype values.

A.3. Interaction energy between microstates and subfields
As the difference, S − S 0 , is linked to the information gained from knowing phenotypic values and ranking them (appendix A.2), given the existence of three distinct genetic microstates, one can define three distinct functions a.k.a. phenotypic fields 'u + (Ω)', 'u 0 (Ω)' and 'u − (Ω)' that are fundamentally related to changes in the phenotype-associated genetic path. In this context, the entire genetic path can be defined with a function representing the sum of interactions between each of the genetic microstates and phenotypic fields under the form: In this context, one may consider that the set of microstates changes the configuration because the fields are 'switch on'. This implies that for the genome positions that are not involved in the formation of the phenotype considered, the switch does not work, that is, the fields are null. In this context, one can consider the equivalence, S − S 0 ∼ E. As a result, the relationship to optimise is: − (Ω) 1 Δ(Ω) dΩ = ω 0 − together with the conservation of probability,ω + (Ω) +ω 0 (Ω) + ω − (Ω) = 1, Euler-Lagrange's method can then be used to determine the optimal configuration forω + (Ω),ω 0 (Ω) andω − (Ω) in a context where the phenotypic fields are imposed. By defining α + , α 0 and α − , the Lagrange multipliers for the conservation of genetic microstates, the relation to optimise with regard to the genetic microstate frequenciesω + (Ω),ω 0 (Ω) andω − (Ω) is then, Using the conservation of genetic microstate frequencies,ω 0 (Ω), can be replaced by, 1 −ω + (Ω) − ω − (Ω), and a variational calculus can be performed on the genetic microstate frequencies, leading to two conditions: The conservation of genetic microstates needs to be added regardless of the genetic path taken, that is, Δθ(Ω 1 ) = 0, expressed as: