A New Strategy using Date Integration for Identifying Interactions between Genes and Environment

The detection of gene-environment (GE) interactions is very important for dealing with complex diseases. Although some methods have been proposed to detect the interactions between genes and the environment, however, in the case of rare variants, these methods are limited. In this paper, we extend the existing ADA method and proposed a combined strategy (i-ADA) to identify the gene-environment interaction for rare variants. Firstly, we use Levene test to test the equality of variances under different sites. Secondly, based on the first-sep P-values and combined with the ADA method, we design a novel interaction test. Finally, we use the GAW17 dataset to illustrate the practicality of the proposed method.


Introduction
Genome-wide association studies (GWAS) have successfully detected a huge percent of common variants associated with many traits and human diseases. However, common variants can only explain a small part of disease heritability [1][2], and many diseases are caused by rare variants; therefore, the existence of rare variants is a complex problem. Until now, many methods have been used to test the main effects of rare variants, for example, The ADA method [3]; The C-alpha method [4] and the sequence kernel association test (SKAT) [5]; However, when the given region includes neutral variants, both of above methods will suffer from losing of power. Therefore, it is necessary to seek further development in considering complex diseases. To date, more and more attention has been paid to detect the interactions between genes and the environment for common and rare variants, e.g., the set-based method for testing gene-environment interactions proposed in [6], through which we can detect the gene-environment interaction of common variants and rare variants; A set based mixed effect model for gene-environment interactions was developed in [7]; Tow-GE was devoloped in [8] and the GESAT method was given in [9]; since most of the diseases are caused by rare variants so [10] developed a noval method(iSKAT) to test the G-E interaction. In this paper, we design a combined strategy (i-ADA) to detect the GE interaction for rare variants. Finally, we illustrate the practicality of the proposed method by GAW17 dataset.

Materials and Method
Consider a chromosome region, and there exist n unrelated individuals sequenced with P genetic variants. We mainly detect the interaction effects of j rare variants and a certain environment where ~ 0, , , and are genotype effects, environment effect and gene-environment interaction effects, respectively. We next design a combined strategy to detect interaction effects on quantitative traits.
Step I: Under the following local statistical model 0 .Thus, testing interaction between one variant and one certain environment (or other variants) is equivalent to testing equality of conditional variances． The F test and Bartlett test are commonly-used methods to test whether the variance of two or more sets of data is equal. However, when the data does not follow the normality condition, the above homogeneity of variance tests has limitation since both of these methods require that data are from a normally distributed population. Therefore, this paper consider a more robust method, i.e., the Levene test [11], since it does not depend on the specific form of the overall distribution and it is suitable for arbitrary distribution and can test the homogeneity of variance for two or more sets of data. So, our method use the Levene statistic to test the equality of variances under different gene site in the step I. The Levene test statistic with its null distribution is Step II: In order to test the GE interaction for rare variant better, following the ADA method [3], we consider M candidate truncation thresholds. For the candidate truncation threshold, we can through the following equations to obtain the significance score of the harmful or beneficial variants , ,⋅⋅⋅ , 0.10, 0.11,⋅⋅⋅ , 0.20 . denotes an indicator variable if coded as 1 and 0 otherwise and is a weight given to the site. Where ; 1, 25 and MAF denote the minor allele frequency [12]. In this way, we obtain the significance score of harmful and beneficial variants, respectively, and the threshold statistic , , 1, 2,⋅⋅⋅, . We adopt the idea of ADA [3]  . In this paper, we design a novel combined strategy to detect the interactions between genes and environment. Our method is called as "i-ADA", since the proposed method is based on the ADA method to test interactions between genes and environment.

Real Data Analysis
To make further evaluation, in this section we have apply the new i-ADA method to analyze geneenvironment interaction for rare variants in GAW17 dataset which is a combination of simulation and real data based on the 1000 Genome Project (http://www.1000genomes.org) . We use quantitative trait Q1 and environmental factors (smoking) in GAW17 dataset [13]. In order to verify the effectiveness of the method proposed in this paper and reduce the computational burden, we will use the genetic data of the first chromosome of unrelated individuals in the GAW17 dataset. Studying the rare variants on the first chromosome is based on the view that "rare variant cause common diseases", that is, the SNP site with MAF ≤ 0.05. In summary, the data structure of this paper consists of a total of 1713 rare variant sites, one environmental factor smoking, and one quantitative trait Q1, with a sample size of 697. Figure 1 shows the allele frequency of each rare variant and we can see that the MAF of each variant range from 0.007% to 5%. The MAF of most variants is relatively low, because most of their MAF is below 0.01 and close to 0.  Table 1. When the P-value is less than the significance level of 0.05, we believe that there exist G-E interaction in this area. The results are shown in Table 1, i-ADA have detected a total of 86 regions, and our method has detected two regions with interactions. Our results show that there are exist certain genes on chromosome 1 that have a significant interaction with smoking. When considering the test of interactions between gene and environment, many current methods focus on testing the interaction of a certain gene, but rarely to test a certain region. However, we propose a region-based interaction test method in this paper. Through this method, we can detect the interaction areas and make good preparations for subsequent analysis.

Discussion
Although some methods have been proposed to detect gene-environment interaction, most of them have limitation, especially when dealing with rare variants. In our proposed method, we extend the existing ADA method and combined the Levene test develop a novel two-step strategy i-ADA for detecting the interactions between gene and environment. We assess the performance of the i-ADA by the GAW17 dataset, and detect the interaction effects of per regions and smoking status on quantitative Q1. In the real data part, for each P-values, we get it through 300 permutations. When the P-value is less than the significance level of 0.05, we believe that there exist G-E interaction in this area. The results are shown in Table 1, i-ADA have detected a total of 86 regions, and our method has detected two regions with interactions. Our results show that there are exist certain genes on chromosome 1 that have a significant interaction with smoking. Our results show that the proposed methods i-ADA is practical. However, our method used in the first step is the Levene test, and its computational complexity lead to longer running times for the program. In the future, we consider optimizing the first stage to increase the calculation speed. Our method is currently aimed at the G-E interaction in a certain region. In order to better discover genes that interact with the environment, based on i-ADA method, we will target specific genes and find specific chromosomes. These will be our future research works.