SNN-SB: Combining Partial Alignment Using Modified SNN Algorithm with Segment-Based for Multiple Sequence Alignments

Multiple sequence alignment (MSA) is an essential tool in the area of bioinformatics. Many MSA algorithms have been proposed last decade, however there is still opportunity for improvement in accuracy. Including partial alignment into MSA has been proved to be an effective approach to improve the quality of results of final Multiple sequence alignment. This paper presents a novel algorithm known as SNN-SB, which used to detect the related residue of protein sequences to build partial alignments using modified Shared Near Neighbors algorithm and segment-based alignments. The partial alignment is used as guide for DIALIGN-TX algorithm to build the final MSA. In order to evaluate the effectiveness of the SNN-SB, we compared the final result with 10 outstanding MSA tools. The results of SNN-SB got the maximum mean Q score and mean SP score on IRMBASE 2.0 benchmark. Moreover, it got around 8% improvement in BAliBASE 3.0 benchmark regarding to Q score compared to DIALIGN-TX.


Introduction
In bioinformatics, sequence alignment is an essential technique of sequence analysis, such as similarity search, phylogenetic tree estimation, protein structure and function prediction [1,2]. In addition, it could help to find disease related gene variants [3] Sequence alignment is a technique that used for rearranging biological sequences such as RNA, DNA or proteins. The main purpose of this technique is to detect the similarity between different sequences to discover similar regions. There are two approaches to align the sequences; pairwise alignment and multiple sequence alignments (MSA). For pairwise alignment approach, only two sequences are aligned to each other, while the MSA is used to align more than two sequences. MSA is known to be a very complex problem, either from the algorithmic view point or concerning to the biological science of the output [4,5].
An extra information about the relation of biological sequences can be embedded into existing MSA tools to increase the accuracy of aligned sequences [5]. The information that helps the MSA tools can be extracted from secondary structure predictions or expert knowledge [6,7]. In addition, the information can be extracted from other resources such as the search of database homology to discover local alignments [8] or incorporating some specific partial alignment from an expert to specify particular input sequences to be aligned, which used as a guide for building final output of MSA. Including partial alignment information into MSA tools has been shown to be an efficient technique. This approach was applied in several methods such as DIALIGN [9], DIALIGN-T [10] and DIALIGN-TX [6], and is it also have been combined with other MSA methods like MUSCLE [11] T-Coffee [12], and Mafft [13].
In this study, we have proposed an improvement method for detecting partial alignment by incorporating the mutual information for each pair of amino acids with rows information from ungapped segments. This proposed method is developed by extending the work in [14]. In this research, the mutual information among the pairwise alignment is extracted using Shared Near Neighbours method (SNN) [15,16]. SNN algorithm is a clustering algorithm, where the similarity of the original Shared Near Neighbours was based on the number of shared neighbors of any two objects. Therefore, this feature of calculating the similarity has been adaptive to discover the strong similarities of alignments to construct partial alignment in [14]. However, the similarity measurement in SNN which applied in [14] depends only on the number of shared neighbors of two amino acids, where the value of similarity between any two amino acids is solely based on the amount of shared amino acids. It is apparent that the degree of match or mismatch between any two amino acids is not taken into consideration. Some studies showed that information contained in the substitution matrix includes biological relationships that can be found among different amino acids, which are used to obtain improved similarity measurement [11,17]. Thus, the neighbours of amino acids contribute highly to the similarity using their affinities. Therefore, a new similarity based on the degree of matches or mismatches has been proposed in this research to be incorporated into SNN. The generated partial alignment of this research will be used as a guide to construct the final multiple alignment. DIALIGN-TX method will be used to evaluate the partial alignment because it has an explicit option to accept this kind of information.

Background and Related Work
Many of the current multiple sequence alignment methods are built around the concept of progressive method [19]. The most basic idea prevailing these conventional progressive alignments is that they start by aligning the most closely similar sequences, after that a new sequence is incorporated to increasing distance. In many cases, the order for picking a new sequence to be aligned to build the MSA is determined based on a guide tree. ClustalW method [20] is the most common progressive method. Further improvement to guide tree in progressive alignment has been presented in GLProbs-Random and GLProbs-Reference [21]. In GLProbs-Random version, the guide tree is generated randomly. The guide tree in GLProbs-Reference version is generated using maximum-likelihood algorithm. Another enhancement in MSA have been obtained by adapting hidden Markov models which presented in ProbCons [22]. ProbCons is an MSA tool that used Probabilistic and Consistency. It uses progressive technique and iterative refinement with Hidden Markov Model (HMM). Probalign [23] developed a similar approach of ProbCons. However, Probalign obtains posterior probabilities from partition function rather than Pair Hidden Markov Model. In the other hand, ProbPFP [24] combines particle swarm optimization algorithm with partition function. In addition, T-Coffee is a method that includes consistency-based approach with progressive alignment [25]. In the progressive alignment part of T-Coffee, a position-specific scoring is used instead of substitution matrix for aligning the sequences. A hybrid approach has been presented in MSAProbs [26]. This approach used pair hidden Markov model and partition function to calculate posterior probabilities among the aligned sequences.
DIALIGN-TX [6] proposed another approach for solving MSA. DIALIGN-TX build the alignment by composing local segments with high similarity. The most common weakness with this approach, that low homologies sequences could be missed easily, thus they may not seem to be statistically significant in the pairwise alignment. To overcome this weakness, new method called SNN-SB has been developed in this work to combine the information of related amino acids with segment-based alignments to produce partial alignments from sequence similarities. To evaluate the effectiveness of the proposed method, the results will be compared with other 10 leading MSA tools in terms of Q and TC scores. The proposed methods wre comparatively analysed against other methods; Dialign-TX [27], Clustalw [20], Mafft [13], MSAprobs [26], Muscle [11], PicXaa [27], Probalign [28], Probcons [22], POA [29] and GLProbs [21] using the default values of input parameters. In addition, we will compare our results with the method that presented in [14], we call it AP-SNN in this research.

Proposed Method
In protein sequences, there are twenty amino acid letters used to represent any sequence. They string over the alphabet letters of amino acids ∑ P rotein = {A, N, R, C, D, H, E, Q, G, I, L, M, K, F, S, P, W, U, T, Y}. To represent any group of sequences, S = (s 1 , s 2 , … s i ), each sequence s i is a series of finite characters from ∑ P rotein , that is, s i ∈ A. The multiple sequence alignment of S appears as a matrix containing i rows of symbols of A∪ {-}., where the number of protein sequences i >2.
The proposed method first entails the construction of all pairwise alignments which used as input. In this study we have used Probcons algorithm [22] to generate the pairwise alignments. The modified SNN method is then adapted to obtain partial alignment columns from pairwise alignments by computing the similarity between any two points.
The original SNN defines the similarity between any pair of points based on their shared neighbours. The two points are considered to be related to the same group if they are nearest neighbours of each other and have more than a specific number of mutual neighbours.
The set of N sequences over the alphabet is denoted by S={S1, S2, S3, … , SN }. After deploying the pairwise alignment for all input sequences, each point in each sequence SN is aligned to at least N-1 points, where any point can be aligned to a point or gap {-}. S1 and S2 are then represented by two input sequences on S, specifically Xi ∈ S1 and Yj ∈ S2. LN denotes the length of the N-th sequence, where 1≤i≤L1 and 1≤j≤L2. If we use the original SNN, then the similarity between any pair of points or characters Xi and Yj in S1 and S2 is generated based on the following equation: where NN(Xi) and NN(Yj) are the list of aligned points of Xi and Yj respectively. The size is equal to the number of points in each list. This is the fundamental idea behind the use of SNN algorithm for similarity measurement between any two points The SNN algorithm has following steps: 1) The Nearest Neighbour List (NNL) is first created (k). For each point, the N-1 points to the current point are kept, where N is the number of sequences. 2) A new shared nearest neighbour similarity is calculated between any two points based on (1).
3) The density is set for each aligned point. From the Nearest Neighbour List (NNL) the method detected the points that have shared nearest neighbour similarity greater than user specific parameters epsilon Eps ( Figure 1). 4) The core points are then found. In this stage, any point has a density greater than specific parameter MinPts (minimum points) will be marked as a core point. 5) The partial alignment columns are created by aligning the core points to each other if their similarity is great than Eps ( Figure 2). In this step, only the points that have strong link are aligned to each other, which will be appear as partial columns.  The initialization of SNN algorithm requires the following three parameters:  k -the number of neighbours, which is equal to N-1, where N is the number of the input sequences.  Eps -the minimum similarity between any two points, which maintains the connections between the similar group of points and break connections to dissimilar points.  MinPts -the minimum density of any point used to determine if a point is core or not. Adapted SNN can be used to detect the related points from pairwise alignment to construct the partial alignment columns using the similarity measurement. However, the similarity measurement in the original SNN algorithm depends on the number of shared amino acids, where the value of similarity between any two amino acids is solely based on the amount of shared amino acids. It is apparent that the degree of match or mismatch between any two amino acids is not taken into consideration. Some studies showed that information contained in the substitution matrix comprises biological relationships that exist among amino acids, which are used for obtaining improved similarity measurement [17]. Thus, the neighbours of amino acids contribute highly to the similarity using their affinities. Therefore, a new similarity based on the degree of matches or mismatches can be incorporated into SNN.
The proposed weighted similarity between any two amino acids X in sequence Si and Y in sequence Sj is calculated using the following equation: Where, Lx i is the character that is aligned to X form its k Nearest Neighbour List (NNL), Ly i symbolizes the character that aligned to Y form its k Nearest Neighbour List, and the parameter n denotes the sequence number. The score between any two amino acids is calculated based on BLOSUM62 substitution matrix [30]. BLOSUM62 was selected for this work because it is most applied and widely accepted approach [31]. Thus, only the scores that are greater or equal to zero are accepted to calculate the proposed weighted similarity, where there is 50% chance of replacement occurring between any two amino acids [32].
It is obvious that the new similarity in (2) is not only dependent on the number of shared amino acids, but also on the affinities of shared aligned amino acids. In SNN method, the value of parameters; Eps and MinPts can range between zero and the size of Nearest Neighbour List (NNL). However, by adding the new similarity in (2) to SNN, similarity threshold is applied to accept or reject any point. The threshold ranges from 0 to a maximum value obtainable using this equation. The maximum value can be calculated by replacing all the scores in (2) with the highest score in BLOSUM62 matrix.
The pseudo-code of finding the weighted similarity between the objects where in this case are amino acids and their respective densities after inclusion of the modified similarity to SNN is expresses in Figure 2. The pseudo-code of aligning the core points after including the weighted similarity is shown in Figure 3. This proposed method is known as initial_SNN-SB.

Including Segment-Based Alignment Approach
In multiple sequences alignment, the basic building blocks are ungapped pairwise local alignments of any pair of protein sequences [33]. Local alignments are named as fragments alignments, where a fragment consists of pair of ungapped segments of two protein sequences [27]. The length of fragments can be determined by a specific value. Therefore, a fragment corresponds to a pair of equal length of ungapped substrings. Multiple sequence alignments can be composed of such fragments [33].
According to Rausch et al. [33], the segment based technique is a useful way for identifying homologies shared between pair of aligned of sequences, which can be used to improve the alignment accuracy. The segment can be defined using global pairwise and local alignments [34].
In initial_SNN-SB, the related amino acids were detected based on the mutual information for each pair of them. It is very clear that a row information of any pair of amino acids was not considered. Accordingly, enthused by the segment based alignment approach, this study incorporates rows information into related points to further improve alignment accuracy. In this approach, the proposed method searches for groups of segments of the input protein sequences that are connected by related points. It is required that each pair of ungapped segments contain at least a pair of shared related point or more to be incorporated into the final partial alignment.
Let S 1 ,…S n  Protein , be n sequences over an alphabet  P rotein . A segment of sequence S i is a triple (S i , P i , L i ), where P is the position of first amino acid of the segment in sequence S i and L i is segment length. A fragment f between two protein sequences Si and Sj is then a gap free alignment of two segments (S r , P i , L i ) and (S t , P j , L j ), where the length of the two segments are equal (L i = L j ). The two segments are accepted to be part of partial alignment only if they share at least one related point and the score between any two aligned pair of amino acid is more than 0, score (P i+k , P j+K ) > 0 for all 0 ≤ k < Li. Figure 4 shows an example of combining segment approach with related points method. Given two sequences, S1 and S2 with two related points shared between them, where the bold residues represent the related points. In the example, the segment length equal to 5. There are two segments with length equal to 5 can be detected in the two sequences. The first segment is {"RFLEG","RFLEG"}, and the second segment is {"HELAS","HELAS"}. However, only the first segment contains related points. Thus, it will be assumed to be part of partial alignment.

. Generating Final MSA from Partial Alignment Columns
A set of partial alignment was obtained from the proposal method. The constructed partial alignment columns were incorporated into Dialign-TX method to evaluate its performance and to tune the input parameters. However, DIALIGN-TX combines progressive strategy with segment-based greedy approach. The MSA in DIALIGN-TX is constructed by searching for similar segments. The segments are detected from the pairwise alignments, and they are greedily inserting or rejected according to their consistency with each other. The main weakness of DIALIGN-TX is that weakly conserved homologies can easily be missed and the segments may be outweighed by spurious random similarities in other pairwise alignments. Thus, the related points from the proposed methods can be used to align the missed segments or to correct the wrongly aligned segments.

Results and Discussion
The proposed method initial_SNN-SB has been performed using BaliBase benchmarks to test variant settings of new similarity threshold, where the other parameters Eps and MinPts will be set at 30% for both. After that, the performance of initial_SNN-SB will be evaluated by comparing its results to the previous methods using BaliBase and IRMBASE benchmarks. Table 1 and Table 2 show the effects of the different values of scale parameter of similarity threshold on initial_SNN-SB in terms of Q and TC scores based on BaliBase benchmark. The range of scoring value of similarity threshold is fraction of its maximum value, where the maximum value can be calculated based on (2), where the range of similarity threshold is starting from 10% to 60%.

Choosing an Appropriate Similarity Threshold
In Table 1 and Table 2, the results from initial_SNN-SB show low sensitivity to change the value of similarity threshold in terms of Q and TC scores. The highest mean values of Q were achieved when the similarity threshold was set to 30%, where Q score is equal to 86.62% and the highest mean values of TC score was achieved when the similarity threshold was set to 30%, where the TC score is equal to 55.50%. Thus, the similarity threshold value at 30% was selected to perform the initial_SNN-SB method in the rest of this research.

The effect of combining segment-base
To study the effect of including segments-base alignment approach in SNN-SB to find related residue of protein sequences (points), SNN-SB was tested using BaliBase benchmark to evaluate different segment lengths. In this testing, the values of other parameters (Eps, MinPts, threshold similarity) were set at 30%. The segment length of SNN-SB was selected in the range of 2 to 10, i.e. a segment can consist of 2 to 10 amino acid residues. The average Q score for each experiment were calculated. Figure 5 depicts the plot of adapting the segment size parameters within the chosen range in BaliBase benchmark in terms of Q score. The best average result in Q score was obtained at the segment length of 5. Thus, the segment length at 5 was chosen to perform the SNN-SB method in this study.  Figure 5. Average Q score with different segment lengths using SNN-SB method on BaliBase 3.0

Comparing the Results of Proposed Methods with Commonly MSA Methods
This section compares the proposed methods with 10 commonly used methods in terms of Q and TC score based on global alignment benchmark BaliBase 3.0 and local alignment IRMBASE 2.0. The comparison between the computation durations of the proposed method and other methods are subsequently presented. Table 3 and Table 4 summarize the comparative analysis between the accuracy of the proposed methods (initial_SNN-SB, and SNN-SB), AP-SNN and 10 commonly used methods based on Q and TC scores. The comparisons were made based on global reference alignment (BaliBase 3.0). The comparison was performed on the six subset references and the overall dataset.

Global alignment benchmark: BaliBase 3.0
The proposed method SNN-SB produced more effective alignment and better accuracy in terms of Q and TC scores compared to Dialign-TX, Clustalw, Muscle and POA with statistically significant superiority of SNN-SB (see Table 3). However, the results in Table 3 and Table 4 show that, the MSAprops method has obtained the highest overall Q and TC score (89.25 and 63.76, respectively) among all other tested methods. The Dialign-TX method exhibited a relatively lower alignment accuracy comparing with other MSA tools in terms of Q and TC score values in global alignment benchmarks (79.51 and 45.53, respectively). From the results, it can be inferred that the SNN-SB method considerably improved the Dialign-TX method by 9.17% and 22.84% in Q and TC scores, respectively, with statistically significant differences. The improvements by SNN-SB method accrued in all six subsets compared with Dialign-TX. In RV11 reference, the SNN-SB method obtained the highest improvement compared to Dialign-TX by 22.36% and 43.63% in terms of Q and TC scores respectively.
Further analysis of the results indicated that the proposed method SNN-SB yielded good performance in RV20 subset, where the SNN-SB method outperformed eight out of ten other commonly MSA methods in terms of Q score, although the RV20 reference contains aligned families with a highly divergent "orphan" protein sequence.    Table 5 and Table 6 provide a brief comparative summary of alignment accuracy of the proposed methods and other 10 frequently used aligners based on Q and TC scores on IRMBASE 2.0 dataset. The SNN-SB method achieved the highest Q and TC scores on the overall IRMBASE dataset with relatively higher statistically significant differences compared to other methods. In the subsets, the SNN-SB method achieved the highest Q and TC scores in Ref3 and Ref4. Although, the method AP-SNN achieved the highest Q score in Ref1subset. In the terms of TC score in subset Ref1, the method Mafft achieved the highest score over SNN-SB. In the Ref2 subset, Dialign-TX outperformed SNN-SB in terms of Q and TC scores, although the superiority is not statistically significant.
It was observed in last subsection that the MSAprobs method obtained the highest alignment score in global dataset BaliBase, where it yielded higher accuracy scores compared to SNN-SB, with percentage differences of 2.82% and 14% in terms of Q and TC scores, respectively (as shown in Table 3 and Table 4). However, it is discernible that the proposed method SNN-SB outperforms the MSAprobs method in local alignment dataset IRMBASE, with percentage differences of 11.40% and 29.98% in terms of Q and TC scores, respectively. The reason behind that, the MSAprobs method designed for global alignment. ClustalW method obtained the lowest Q and TC scores in the local alignment dataset. Similar results of ClustalW have been reported in [35]. ClustalW was run with default settings. In ClustalW, there is no refinement process to recorrect any error that can be made during any step of building of MSA, and then the errors can be propagated to the final alignment. Moreover, the quality of MSA in ClustalW is affected by the way the guide tree is constructed.
The overall results of SNN-SB method showed some improvement in alignment accuracy by incorporating the segment based approach to find more related points. The segments were valuable in determining the homology among the sequences and assisting in improving the alignment quality by finding many undetected similarities.

Conclusion
In this paper, we have developed initial_SNN-SB and SNN-SB, new approaches for building partial alignments of MSA. In initial_SNN-SB, Shared Near Neighbours clustering algorithm (SNN) was adapted and modified to obtain partial alignment columns by detecting of the mutual information for every pair of amino acids from pairwise alignments. A new similarity measurement was included to the proposed method to benefit from the degree of match or mismatch between any two amino acids. SNN-SB is an extended version of initial_SNN-SB, a segment based approach was combined with the partial alignment of initial_SNN-SB. Segment based approach was used in SNN-SB to identify homogeneity shared among the portion sequences in order to improve the alignment accuracy. The comparative analysis of the proposed methods is presented based on two benchmarks; BaliBase and IRMBASE. The proposed method SNN-SB shows good accuracy compared to other proposed methods in BaliBase and IRMBASE. On the benchmark BAliBASE, the Q and TC scores of SNN-SB algorithm are close to the highest one in many cases. In IRMBASE dataset, the SNN-SB outperformed the 10 leading methods. These findings show that the accuracy of the final MSA can be greatly improved by combining the relevant points with segment alignments.