Implementation of Hierarchical Clustering Method in Analyzing Genetic Relationship on DNA SARS-CoV-2 Sequences

In mid-September of 2020, WHO released data starting that more than 28 million people worldwide have contracted coronavirus. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the full name coronavirus, specifically Covid-19. This virus attacks the human respiratory system and can cause infection in the human lungs and even death. WHO noted that more than 900 thousand people worldwide have died due to being infected with the coronavirus. In Indonesia, more than 210 thousand people have been infected by the coronavirus, and more than 8,5 thousand of them have died. Based on this data, it is necessary to analyze the coronavirus’s kinship to reduce the spreading. This research uses The Euclidean distance in determining the distance matrix. This research will then use the Hierarchical Clustering method for analyzing the genetic relationship on DNA SARS-CoV-2 sequences. This research will take samples of SARS-CoV-2 DNA sequences from 20 countries infected. From the simulation result, the ancestors of SARS-CoV-2 coming from China. Besides, it also found that the SARS-CoV-2 DNA sequence from Indonesia has the closest ancestor from Pakistan.


Introduction
Bioinformatics is a combination of two big majors in research; computer science and biology. Bioinformatics involves the computer's primary functions, such as storing, collecting, manipulating, and distributing information related to biological macromolecules such as DNA, RNA, and protein [1]. [2] also explains that Bioinformatics is a multidiscipline area of biology, chemistry, physics, mathematics, and computational sciences. Bioinformatics also learns about the evolution of genes related to the structure, function, and proteins, and entire genomes; development of methods for management and analysis of biological information arising from genomics and high achievement of experiments [3]. From this explanation, we know that Bioinformatics is a combination of several scientific fields that learn about the evolution of genes, DNA, RNA, and proteins. Currently, the world is experiencing a virus pandemic, the COVID19 (Corona virus Disease 2019) pandemic. Wuhan, one of the districts in Hubei province, is known to be the original place of this virus that is now spreading rapidly throughout the world.
This virus originated in Wuhan, precisely in China's Hubei, province at the end of December 2019 and has now spread worldwide. Coronavirus poses a severe threat to human health as; it can cause severe acute respiratory distress. Based on WHO data, in mid-September 2020, more than 28 million people worldwide have been infected with this coronavirus (SARS-CoV-2). This virus attacks human respiratory system, which can cause lung infections and can even cause death-noted that more than 900 thousand people worldwide have died due to disease associate with the coronavirus. In Indonesia, more than 210 thousand people have been infected by the coronavirus, and more than 8,5 thousand of them have died. Therefore, to reduce the spread of this virus, it is needed to learn about this virus's kinship. The most used way to know this virus's kinship is to build a phylogenetic tree or clustering. Therefore, the researcher is interested to Implement the Hierarchical Clustering Method in Analyzing Genetic Relationship on DNA SARS-CoV-2 Sequences.

Relevant Research
In this study, researchers plans to use the SARS-Cov-2 DNA sequence data as research data; and aims to help research activities amid the world's COVID-19 pandemic situation, especially in Indonesia. Researchers used the hierarchical clustering method in analyzing kinship through the formation of phylogenetic trees. There are several studies related to DNA sequences analysis [4] and to discussing the problem of automatic classification of isolates in HIV DNA. In that study, the researcher used n-Gram-based classification and unsupervised Hierarchical Clustering. The result turn out positive and shows that the proposed technique can solve different classification problems. [5] used Bayesian, Maximum Parsimony, and UPGMA methods in analyzing rRNA sequences. His study results show that the UPGMA tree produced from 16S rRNA rows and d-loops are also identical. From the research in reference [6], the researchers used K-Means Clustering for grouping 16 DNA data from the Hepatitis B virus (HBV). In their study, they grouped HBV virus data into 2 clusters. The first cluster contains 11 HBV viruses with several base pairs (bp) of more than 3000 bp and the second group containing 5 HBV viruses less than 2,000 bp.
Reference [7] also applied k-means clustering to gene expression microarray data, which measures human gene expression in the four stages of erythropoiesis. The study results are eight groups identified as a cluster of 450 genes (C4) that are more active towards the maturation stage, in cell division, and DNA replication processes. Other groups of 234 genes (C7) are more involved in autophagy (cell consumption/destruction), which are known to be involved in enucleation (expulsion of the nucleus from cells). Host spread of the SARS virus. The result of aligning host and SARS Co-V viruses in humans with SPA (Super Pairwise Alignment) is that mutations occur with type I protein replacement at positions 77, 139, 147, 244, 344, 360, 472, 480, 487, 577, 609, 613, 665, 765, 778, and 1163. Reference [8] used biclustering algorithms to solve the Multiple Sequence Alignment (MSA) problem. Their research developed the Local MSA computer program, Block MSA, and combined it with biclustering.

SARS-CoV-2
Coronavirus 2019 is caused by a new coronavirus, namely 2019-nCoV, also famous as Severe Acute Respiratory Syndrome Ccoronavirus 2 (SARS-CoV-2) or human coronavirus 2019 (HCoV-2019) [9]. [10] explained that SARS-CoV-2 is also the source of this COVID-19 pandemic. In contrast to SARS-CoV and the Middle Eastern Respiratory Syndrome (MERS) Coronavirus (MERS-CoV), these two viruses are very pathogenic, which caused a global epidemic that reached Indonesia in 2003. SARS-CoV-2 can transmit rapidly from human to humans. At the present, there is no vaccine or any form of therapy to prevent and treat SARS-CoV-2 [9]. [11] also explained that COVID19 disease can spread through droplets or any direct contact with the patient. The incubation period is estimated 6,4 days, necessary reproductions of 2.24-3, 58. The main characteristic of a person infected with this virus (or usual patience with pneumonia caused by SARS-CoV-2) is that they have a fever and which is then followed by coughing.

Sequence Alignment
Sequence Alignment is an effective method for analyzing the position and types of mutations hidden in biological sequences and allowing precise comparisons to be made [12]. Sequence alignment is used to see Similarities between DNA sequences. In physical, computational, and bioinformatics, studying the similarity of DNA sequences is essential. In almost all studies that explore evolutionary relationships, analysis of gene function, and prediction of protein structure and sequencing, it is necessary to calculate similarities [13].
Reference [14], explains the alignment of sequences. They assume that a linear gap model (where, (−, ) = ( , −) = − for ∈ and > 0 resulted in the gap with the length of L is the same as -) and presents an algorithm, the Needleman-Wunsch [NW] algorithm, which can meet the optimal value of global alignments. The primary purpose is to make an optimal alignment of the subsequences. Dynamic programming algorithms are algorithms that can reach optimization to improve performance for a smaller amount of data. For example, there are two sequences = 1 2 … . . … … and = 1 2 … . . … … . We set up ( + 1) × ( + 1) -matrix F. The ( , ) ℎ element ( , )for = 1, … , , = 1, … , is the same as the optimal alignment score between 1 … and 1 … . Element ( , 0) for i = = 1, … , is the score of aligning 1 … to a gap region of length i. Equivalently, the element (0, ) for = 1, … , is the score of aligning 1 … to a gap region of length j. We can make F in The three relatable ways to get the best value of ( , ): aligned to (look at the 1 st line), or aligned to a gap (the 2 nd line), or is aligned to a gap (the 3 rd line). Knowing ( , ) we keep a pointer to the option from which ( , ) was produced. When it reaches ( , ), then do restoration of optimal alignment.

Hierarchical Clustering
Hierarchical methods and other grouping algorithms as a significant computational way to find a cluster. Generally, it is not possible to check all the group's possibilities for one set of data, especially the biggest one [15].
Distances or differences between rows of data matrices resulted in The Hierarchical clustering trees (Harrigan in [16]). The data clustering results from Hierarchical is more diverse compared to non-hierarchical because it provides more partition trees. We can obtain different grouping levels by cutting the dendrogram that has been made by the hierarchical clustering algorithm at different levels. The hierarchical clustering agglomeration has a time complexity of ( 2 ) . It uses nearest neighbor reciprocity and reproducibility. The same case is in the R hclust function. Hierarchical clustering starts from a single n cluster and successively combines the most similar sets to form a larger one. Repeat this step to get the desired number of groups. The algorithm is : [17].

Output:
Partition cluster = { 1…, , , … , } 1. First, form the n cluster, with each having one data point . . , = and ′ = 2. When ′ > , run it. 3. Then, construct a similarity matrix S, where the element Sij is the similarity between clusters i-th and the j-th (The linkage function d calculates the similarity Sij). 4. Look for the two closest clusters and and unite them to form a new cluster , . . , = ∪ 5. Let ′ = ′ − 1 6. end while 7. back to = { 1, 2 , … , } In this study, the agglomerative hierarchical clustering use is [15] : • The Complete Linkage Method used to find the maximum distance between a point in A and B.
where the total is overall in A and all in B. Join the two clusters with the smallest distance in each step.

Distance Matrix
For the continuous features matrix, The Euclidean distance might be the most popular one used. This following formula is to count the length of The Euclidean; = √∑ ( − ) 2

=1
(4) Where index i repeats all values in the vector. Euclidean distance matrix uses two components based on the magnitude to measure the geometric distance between it. In some cases, two vectors who have high correlation values might not be presented well by the Euclidean distance d e, where their length is the only one matters [18].

Proposed Method
This study uses the implementation of hierarchical clustering to analyze the genetic relationship of the DNA of SARS-CoV-2 sequences through the open-source programming language R. The summary is in the flowchart in Figure.1.

Research Data
In this study, the SARS-CoV-2 DNA sequences kinship amounting to 20 DNA sequences from 20 countries in 2020. The source of these 20 SARS-CoV-2 DNA sequences is from GenBank via http://www.ncbi.nlm. nih.gov/ and https://www.gisaid.org/ (specific for Indonesia DNA data). Table 2 shows the grouping for 20 SARS-CoV-2 DNA sequences using the Hierarchical Clustering algorithm after implementing R software based on open source. The steps used in this research are finding the collection of SARS-CoV-2 DNA sequences, distance matrix, Hierarchical Clustering, and kinship grouping using phylogenetic trees.