Comparison study in statistical estimation of gene effects based on a real data set

Gene mapping study is very important task in current biostatistics and real life science. Most biological traits are mainly controlled by the genetic factors, i.e., the function of genes in the genome, and also affected by some environment factors. Reasonable statistical theory and methods should be used in the gene mapping study, so that the gene effects can be inferred precisely and reliable evidence can be further provided to practical domains of genetics and medicine. In this paper, we compare two statistical methods (conventional LSE method and the Dantzig Selector method) on estimating gene effects in the framework of linear model, and make some comparison on them. The two methods are illustrated by applying to a real data set. The analyzing results show that the Dantzig Selector method has some advantagement although the two methods give similar parameter estimates.


Introduction
Life science has made rapid process in recent years, with the development of statistics and computer science. Now it is possible for researchers to consider the genetic mechanism of biology from more microcosmic aspect, so that more biological traits including all kinds of complex diseases. Scientists have found lots of gene loci that affect some human diseases. The main effects of these genes explain large proportion of heritability of the traits, which means that the statistical inference of main effects is key step in the gene mapping study. Of course, there is still additional heritability that needs to be detected [1,2].
Many statistical methods were used and newly developed in the course of detecting causal genes or genetic variants. Using RFLP linkage maps, Lander and Botstein [3] considered the mapping problem of Mendelian factors underlying quantitative traits, which is a famous early work of this aspect. Haley [4] applied simple regression method to map quantitative trait in line crosses using information of flanking markers. Kwak et al. [5] mapped quantitative trait loci underlying function-valued phenotypes via a regression-based statistical method. To perform high resolution mapping of quantitative traits, Lou et al. [6] proposed the idea of unifying interval and linkage disequilibrium mapping. In additional, aiming at other kinds of traits, generalized linear mixed models can be considered in statistical analysis [7].
In this paper, we consider the estimating problem of gene effects in the framework of linear model, and use two statistical methods (conventional LSE method and the Dantzig Selector method [8,9]) to infer the gene effects which mainly control the quantitative trait. The analysis of a real data set of

Theory and methods
A quantitative trait is considered here, which is potentially controlled by multiple gene loci and one environment variable.

Statistical model
The statistical model that measures the relationship between the trait value and the genotype values is as follows Since we focus on the inference of main effects in this paper, we set all 0 (1).

Methods
The trait values and the genotype values can be observed, therefore model (1) is a linear model. When the number of loci M is moderate and smaller than the sample size n, the conventional LSE method can be directly used to estimate the effect parameters; but if M is larger than the sample size n, the LSE method cannot used, and other statistical shrinkage methods (e.g., the DS method) should be considered to deal with this problem. The DS method has optimal l2 rate properties under a sparsity scenario, that is to say, when the number of non-zero components of the true parameter vector is small, and it is also computationally feasible when the number of parameters is much larger than sample size. The DS method can be executed through the Matlab program given by its authors [8].
When the number of loci M is smaller than the sample size, both of these two methods can be used to estimate parameters, but different inference results may be given since the goal of the DS method is to find the best subset of variables via linear program, but the LSE method can provide the closedform solution for the parameters. Hence reasonable explanations should be sufficiently considered in practical data analysis.

Real data analysis
To compare the performance of the LSE method and the DS method on estimating gene effects, we used barley data set from the North American Genome Mapping Project (Tinker et al. [10] ) was used for illustration. The barley data set contained 145 individuals and 7 traits and 127 mapped markers were genotyped. In this section, we mainly analyze the trait of kernel weight.
Firstly, we standardized the trait data and fitted the distribution of the kernel weight trait (see Figures and 1 and 2). No matter the histogram of kernel weight trait or the fitted density all support that the trait follows a normal distribution. So model (1) can be used here to infer genetic parameters.     The results in Figures 3 and 4 for the two methods are certainly different, i.e., the results from the DS method is more centralized, and larger effects show more obviously, which presents better shrinkage property of the l1 regularization problem; In contrast, the results from the LSE method is more uniform and only the largest estimated effect lies out of the group of all effects. In order to further show the difference between the two groups of estimate results, in Table 1 we listed the top 15 gene effects by methods DS and LSE, respectively. From this table, 6 main loci were common detected the two methods (see the bold type number in Table 1), and locus 102 showed the most significant main effect among all the gene loci considered.

Conclusion and discussion
The genotype main effects provide major contribution to the genetic variance, therefore effective statistical methods should be considered in the real data analysis. In fact, estimating the main effects is the first step in gene loci mapping, and the interaction effect analysis [11] can be done on the basis on the first step, although the main effects and interaction effects have no direct causal relationship. If the interaction effect analysis is also considered in our real data analysis part, then only the DS method can be used to do this, as the number of parameters is much larger than the sample size.
The DS method estimates parameters from the viewpoint of model selection, and it builds optimal mathematical properties under a sparsity scenario, which is significantly different from the conventional LSE method. But if the number of parameters is smaller than the sample size and the sparsity condition about parameters does not satisfy, the two methods may give similar results.