SNPs Selection using Gravitational Search Algorithm and Exhaustive Search for Association Mapping

Single Nucleotide Polymorphisms (SNPs) are known having association to phenotipic variations. The study of linking SNPs to interest phenotype is refer to Association Mapping (AM), which is classified as a combinatorial problem. Exhaustive Search (ES) approach is able to be implemented to select targeted SNPs exactly since it evaluate all possible combinations of SNPs, but it is not efficient in terms of computer resources and computation time. Heuristic Search (HS) approach is an alternative to improve the performance of ES in those terms, but it still suffers high false positive SNPs in each combinations. Gravitational Search Algorithm (GSA) is a new HS algorithm that yields better performance than other nature inspired HS. This paper proposed a new method which combined GSA and ES to identify the most appropriate combination of SNPs linked to interest phenotype. Testing was conducted using dataset without epistasis and dataset with epistasis. Using dataset without epistasis with 7 targeted SNPs, the proposed method identified 7 SNPs - 6 True Positive (TP) SNPs and 1 False Positive (FP) SNP- with association value of 0.83. In addition, the proposed method could identified 3 SNPs- 2 TP SNP and 1 FP SNP with association value of 0.87 by using dataset with epistases and 5 targeted SNPs. The results showed that the method is robust in reducing redundant SNPs and identifying main markers.


Introduction
Association mapping is a study aiming to detect linkage between genetic polymorphisms and phenotypic variations in existing germplasm. One of the most popular marker to identify genetic polymorphisms is Single Nucleotide Polymorphisms (SNPs), since it allows generation of abundant information on genetic variability at DNA level. The analysis of AM has increased human understanding of phenotipic variations and heritability in human, animals and plants [1] [2].
A number of methods for analyzing AM have been proposed before. In the conventional methods, AM is conducted based on investigation of single locus and the interest phenotype [3]. Nevertheless, the methods are not suitable for complex phenotypes with epistases interaction among genes [3] [4]. The search for multi-locus association is potential to explain the phenomenon of complex phenotypes and genetic polymorphisms. Multi-locus association is classified as a combinatorial problem, it seems simple to be solved but hard to be implemented because the search space would increase exponentially according to the growth of data.
Multi-locus AM with 1000 SNPs require computer with high performance to evaluate 8.25×10 12 combinations for five SNPs. Reducing the redundant and uncorrelated SNPs in the search space is the important task. One of the reducing techniques could be applied is using heuristic search and followed by conducting exhaustive search to the remaining SNPs to find the most appropriate combination of SNPs. This approach was implemented using Ant Colony Optimization (ACO) and exhaustive search in the case of case-control phenotype [3].
Gravitational Search Algorithm (GSA) is one of the latest heuristic search that developed based on the law of gravity and mass interaction. It was first introduced by Rashedi et al in 2009 [5]. It becomes to continue growing and being popular in researchers. Many modified and hybrid GSA have already proposed by researchers. They showed that GSA has better performance in terms of computation time and convergence than the previous heuristic search [6]. In this study, we proposed a method combining GSA and exhaustive search to find the most appropriate combination of SNPs that has linkage to quantitative phenotype.

Material
The proposed method was tested using two types of simulated dataset which were used by Oliveira et al [7]. The first dataset only has main effects without interaction among SNPs, while the second one with epistasis among SNPs. The dataset were generated by the function of simulateSNPglm of the scrime package in the R software. No filters HWE and call rate in simulated dataset were used, since there were not any missing values and no markers in Hardy-Weinberg disequilibrium.

Simulated dataset with epistasis 2
Epistasis could be defined as a form of functional interaction among genes that caused not common phenomenon in phenotypic variations [4]. Illustration about epistatis is showed in Figure 1.

Method
The SNPs selection method proposed in this study consist of two stages, selection using heuristic search in the first stage and exhaustive search in the second one. As mention in introduction, finding the most appropriate combination of SNPs that has linkage to quantitave phenotype is classified as a combinatorial problem, for which the worst case time requirement grows exponentially with the number of SNPs. To address this problem, heuristic search was conducted to select a number of possible SNPs that have linkage to phenotype. In other words, the first stage was aimed to reduce the number of redundant SNPs, this stage was conducted two times. However, solution provided by heuristic seach is not necessary as the best solution for the given problem [6]. Therefore, exhaustive search was needed in the final stage to select targeted SNPs. The association between SNPs and phenotype was conducted using SVR [8] first, SVR generated predict phenotype based on the given combination of SNPs and the predict phenotype was associated to the phenotype in simulated dataset. The correlation between predicted and simulated phenotype was being the evaluation of the fitness function.  It was developed based on the law of gravity and mass interactions [5]. As the common heuristic search, GSA is comprised of parallel artificial agents and each agent represents a solution of the problem. Consider an artificial population consist of N agents, then solution of the th i agent is defined by The solution of the th i agent represents a combination of n SNPs, where d i x represents the th d SNP in a combination and n is the dimension, represent the desired number of SNPs for a combination. All agents attract each other by a gravity force, and this force causes a movement of all agents globally towards the agents with heavier mass. Mass of an agent correspond to the evaluation of the fitness function for the given solution. The attraction and the movement are iterative procedures which would be stopped at a predefined number of iterations, T . At a specific time t , the force acting on agen i from agen j is defined as the following: where j rand is a random number in the interval [0,1] and Kbest is the set of first K agents with the best fitness value and biggest mass.
Kbest is a function of time, initialized to 0 K at the beginning and decreasing with time.
According to the law of motion, the acceleration of the agent i , at time t , in the d th dimension, is given as follows: where ii M is the is the inertial mass of the i th agent. The next velocity of an agent is considered as a fraction of its current velocity added to its acceleration. Therefore, its velocity and its position could be calculated as follows: where i rand is an uniform random number in the interval [0,1]. The gravitational constant, G , is initialized at the begining and will be reduced with time to control the search accuracy. In other words, G is a function of the initial value ( 0 G ) and time ( The masses of the agents are calculated using fitness function. A heavier mass means a more efficient agent to solve the problem and moves more slowly. Supposing the equality of the gravitational and inertia mass, the values of masses is calculated using the map of fitness. The gravitational and inertial masses are updated by the following equations: The principle of GSA is shown in Figure 3.
SNPs of the best agent,

Exhaustive Search
The number of outcome SNPs of the previous stage were less than the number of outset SNPs,

Results and Discussions
The performance of the proposed method applied to both simulated dataset was compared to the method proposed by Oliveira et al [7] and generic GSA [5] itself. The comparison was conducted using two simulated dataset, without epistasis and with epistasis.

Simulated dataset without epistasis 1
Combination of SNPs generated by the proposed method and the other two methods using simulated dataset 1 were showed in Table 1. From the table, it was seen that the generic GSA was able to identify all of targeted SNPs but suffer of many false positive SNPs. The false positive SNPs may lead models' performance to predict phenotype become low, even though all of targeted SNPs consist in the generated combination. As could be seen in Tabel 1, correlation of the generic GSA was the lowest, 0.262, compared to the other methods.  1, 10, 20, 30, 50, 60, 72 ** 0.830 * )SNPs collection of best agen in every iteration ** )The bold SNPs is the targeted SNPs The method proposed by Oliveira et al [7] was able to reduce 977 of 993 redundant SNPs, but the given combination still suffer of many false positive SNPs, 16 false positive SNPs. Besides that, there were two other targeted SNPs, the SNP40 and the SNP50, unidentified by the method. The correlation was higher than the generic GSA but lower than the proposed method.
As mention before, the generic GSA [5] was able to identify all of targeted SNPs. Besides that, it was also able to reduce redundant SNPs, 823 of 993 redundant SNPs, although the result was not better than that of the Oliveira method. Because all of targeted SNPs were consisting in the combination generated by generic GSA and the number of redundant SNPs was lower than those of in the outset SNPs, it was possible to conduct exhaustive search to find the most appropriate combination of SNPs. From Table 1, it could be seen that the performance of the proposed method was better than the Oliveira method in term of identifying targeted SNPs. Besides that, the number of false positive SNPs in the proposed method was lowest than the two other methods. Moreover, the correlation yielded by the proposed method was the highest among others, it was 0.83.
There was one SNP, the SNP40, which was not able to be identified by the Oliveira method and the proposed method. Figure 4 showed the relationship between the SNP40 and the simulated phenotype 1. From the Equation 1, it was known that SNP40 would affect to the phenotype when its value was 3. The Figure 4 showed that the number of sample with the value of SNP40 is 3 was less than the number of sample that the value of SNP40 is not 3. Figure 5 showed the relationship between the SNP30 and the simulated phenotype 1, the same thing also happening to SNP30. From the Equation 1, it was known that SNP30 will affect to phenotype when its value is 3. Figure 5 showed that the number of sample that the value of SNP30 is 3 was less than the number of sample that the value of SNP30 is not 3 but the models were able to identified it. From the Equation 1, it could be seen that the beta coeficient of SNP30 was greater almost five folds than the other beta coeficients. It means that, SNP30 is the main marker for the given phenotype. Meanwhile, the SNP40 is not the main marker and only a few samples show effect to phenotype because of this marker. Therefore, the models could not identified the SNP40 except generic GSA. The comparison of models performance based on true positive rate, false positive rate and correlation was showed in Figure 6. Generally, the proposed method and the Oliveira method [7] were able to reduce redundant SNPs than the generic GSA [5]. The proposed method was better in identifying targeted SNPs than the Oliveira method, it could be seen from TPR and correlation metrics were higher than the Oliveira methods.

Simulated dataset with epistasis 2
The performance of the proposed method using simulated dataset 2 was showed in Table 2 generic GSA [5]. Nevertheles, the ability of all methods in identifying targeted SNPs was lower than before.
The generic GSA was able to identify two targeted SNPs, SNP3 and SNP4, while it suffer with 62 false positive SNPs and its correlation was 0.23. The Oliveira method [7] was able to identify one targeted SNP, the SNP3, with no false positive SNP and the correlation was 0.95. The proposed method was able to identify two targeted SNPs, SNP3 and SNP4, with one false positive SNP. The correlation was 0.87. From Table 2, it was seen that only SNP3 could be identified by the proposed methods.
In the simulated dataset 2, there was only one sample with condition (SNP4 != 2) and (SNP3 != 1) and there was no sample with condition (SNP12 != 1) and (SNP9 == 3). According to [7], the simulated dataset 2 was generated with high deviation and asymmetry data. It was done to examine the robustness of the method.
The comparison of models performance based on true positive rate, false positive rate and correlation using simulated dataset 2 was showed in Figure 7. From the figure, it could be seen that TPR of all methods were low and the correlations were high, except the generic GSA method [5]. It is mean that the proposed method and the Oliveira method [7] are robust in reducing the redundant SNPs and identifying main marker affected the simulated phenotype dataset 2.

Conclusion
The proposed method successfully combined GSA and exhaustive search to solve the association mapping problem. The method was tested using two types of simulated datasets, without epistasis and with epistasis. In both simulated datasets, the proposed method shows the robussnest in reducing the redundant SNPs and identifying the main marker SNP which affect the simulated phenotype. It is concluded that the proposed method has good potentiality to be applied in association mapping with real dataset.