Comparison of genomes of different species of coronaviruses using spectra of periodicities

In the genomes of different organisms, there are periodicities, i.e. fragments of DNA (RNA)-sequences formed by tandem repetition of the basic monomer (period). The spectra of periodicities with lengths exceeding the ‘noise’ threshold are quite compact and visible even for complete genomes. This makes them an acceptable tool for differentiating closely related objects. The objects of analysis in this work are the periodicities at genomes of three species of coronavirus: MERS, SARS, and SARS-CoV-2. It has been shown that there are markers in the form of periodicities that make it possible to distinguish between these species of coronaviruses. None of the periodicities identified in the genomes of the MERS species (except for the poly-a tract in the 3‘UTR) is found in the genomes of SARS and SARS-CoV-2 and vice versa. Revealed periodicities common to SARS and SARS-CoV-2, as well as inherent only to genomes of one species. The number of periodicities in SARS and SARS-CoV-2 significantly exceeds the number of periodicities in random sequences. The periodicities found in almost all genomes of only ‘their’ species are of the greatest interest in terms of revealing the pathogenic potential of the virus.


Introduction
Periodicities or short tandem (simple) repeats are widespread molecular markers in genetic and genomic research [1]. They are used for DNA profiling in cancer diagnosis [2], in kinship analysis (especially paternity testing) [3] and in forensic identification [4]. They are also used in genetic linkage analysis to locate a gene or a mutation responsible for a given trait or disease. Microsatellites are also used in population genetics to measure levels of relatedness between subspecies, groups and individuals [5].
The choice of an algorithm for detecting tandem repeats essentially depends on the type of repeats under consideration: perfect [6,7], imperfect [8], or hidden latent [9].
With regard to the problems of classification of DNA (RNA) -sequences, perfect periodicities are of the greatest interest. They are used as molecular markers in the intra-and interspecific classification of organisms [10 -12].
By the local periodicity P we mean the fragment of a sequence S formed by the tandem k-fold repetition of the basic (minimum possible of the acceptable variants) monomer M (k ≥ 2). For clarity, we will separate the monomers with the symbol '-'. For example P = tcaagtcaagtcaag = tcaag-tcaagtcaag. Here the monomer M = tcaag, its length p = | M | = 5 and the multiplicity k = | P | / p = 3. The periodicity can be terminated with incomplete monomer, while k is not an integer. So, for P = tcaagtcaag-tcaag-tc, k = | P | / p = 17/5.
From the entire set of periodicities, we exclude insignificant ('random'), the length of which does not exceed a given threshold r. The set of periodicities at the sequence S such that | P |  r will be called the truncated spectrum of periodicities and denoted by r(S).
The object of research in this article is the periodicities in the RNA sequences of the genomes of the coronavirus of the three most well-known species: MERS [13], SARS [14] and SARS-CoV-2 [15,16]. The first confirmed case of Middle East respiratory syndrome-related coronavirus (MERS-CoV) was reported in Jeddah, Saudi Arabia in April 2012. By July 2015, MERS-CoV cases had been reported in over 21 countries, in Europe, North America and Asia as well as the Middle East. Severe acute respiratory syndrome coronavirus (SARS-CoV or SARS-CoV-1) caused the 2002-2004 SARS outbreak. Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was identified in Wuhan, China, in December 2019. The disease has since spread worldwide, leading to an ongoing pandemic.
The set of complete genomes of coronaviruses we are investigation (from those presented in the Genbank and Gisaid databases) contains 36 MERS RNA sequences, 24 SARS sequences and 42 SARS-CoV-2: 2 complete genomes for each of 21 countries. A list of genomes is provided in Appendix A, B and C.
Various methods are used to compare the genomes of these species [17,18]. The purpose of this work is to calculate and analyze the spectra of periodicities for three species of coronavirus, to identify their general and specific features (primarily related to SARS-CoV-2) and to experimentally evaluate the statistical significance of the results. It is assumed that the results obtained will be useful to physicians and biologists involved in a comprehensive study of coronavirus infection using diverse information, in this case, at the genomic level.

Description of experiments
To calculate the spectra of periodicities, we used algorithms and programs developed at the Sobolev Institute of Mathematics SB RAS for estimating the complexity of symbolic sequences in the language of multi-type repetitions contained in them [19]. The final product was the decomposition of the sequence into fragments corresponding to one or another type of repeats. The resulting 'complexity decomposition' of the sequence explicitly contain fragments formed by tandem repeats.
The algorithm allows detecting all periodicities. Then, of the entire set of periodicities, we leave for consideration only those whose length is not less than a certain threshold value r, i.e. a set r(S). With the length of the coronavirus genomes of the order of 30 thousand nucleotides, the choice of the lower value r = 12 for the lengths of the detected periodicities approximately corresponded to the border between random and nonrandom structures. The 5'UTR region in the genomes under consideration does not contain periodicities with lengths exceeding the specified threshold. Therefore, for ease of comparison of results, the numbering of positions in the following tables 1-5 coincides with the beginning of the coding region (orf1ab gene). Along with a set of 42 SARS-CoV-2 genomes, which was used for comparison with the SARS and MERS genomes, an expanded set of 560 complete genomes was considered: 320 sequenced in Russia and 240 sequences from other countries (12 strains from 20 countries each). Since the results are consistent, the following are only more visible data based on the results of processing a small set of 42 genomes. Table 1 shows the periodicities (| P |  12) found in almost all genomes of the MERS species (36 RNA sequences from GenBank listed in appendix A). Rare exceptions are indicated in the last column of the In addition, the periodicity tcaact-tcaact whose period is symmetric in position 25488 (orf3) occurs in 23 genomes from of 36. The sixth column shows the amino acid sequence corresponding to the considered periodicity (in bold) with a left and right context of 6 amino acids. Note that the periodicity at the nucleotide level does not always correspond to the periodicity at the amino acid level.

Periodicities in the genomes of the MERS species
In 20 from 36 genomes of the MERS species, as well as in 11 from 24 SARS and 13 from 42 SARS-CoV-2, a poly-a tract is found in the 3'UTR region (with different repetition rates). This is the only the periodicity that is common for all species of coronaviruses.
Seven from 36 genomes contain periodicities that are not found in other genomes. They are presented in table 2 Note that none of the periodicities found in MERS (except for the poly-a tract in 3'UTR) is found in the genomes of SARS and SARS-CoV-2.
The dependence of the number of periodicities on the period length (in total for the entire set of 36 MERS genomes) is shown in figure 1. The maximum (278) falls on the period length 6, which is partly related to the choice of the periodicity length threshold r = 12. When this value decreases to r = 10, the peak shifts by p = 5. It is interesting to note that there are no periodicities with period lengths of 2, 4, 8 and larger.  When the threshold r is reduced to 11, this list is supplemented with the a-agga-agga-ag periodicity with symmetric period found at positions 14394 (SARS) and 14463 (SARS-CoV-2) and the attacaattaca-a periodicity that terminates gene S (positions 24980 and 25104, respectively). Table 4 shows the periodicities of length 12 and higher presented in each of the 24 genomes of the SARS species, but not found in the SARS-CoV-2 genomes.   Appendix B) and only in it, the catgaacatgaa periodicity (12/ 6) was found at position 700. The nature of the dependence of the number of periodicities on the length of the period (for the entire set of SARS genomes) is similar to that shown in figure 1. The maximum value is 499 at p = 6. Table 5 shows the periodicities of length 12 and higher presented in each of the 42 genomes of the SARS-CoV-2 species that are not found in the SARS genomes. In addition to the periodicities presented in table 5, no other periodicity was revealed that is found in individual genomes of the set. In an expanded set of 560 SARS-CoV-2 genomes, only 3 strains were identified, in which periodicities not indicated in Tables 3 and 5 are presented: Poland/PL_P5/2020; поз. 25160 : aacttt-aacttt (12/ 6); EPI_ISL_450249#Russia#Saint-Petersburg#2020-04-21; поз. 25104 : attac-attac-at (12/ 5) and EPI_ISL_450251#Russia#Saint-Petersburg#2020-04-20; поз. 18995: taactt-taactt (12/ 6).

Comparison with pseudo-random sequences
The data for the experiment was generated by randomly shuffling elements of each sequence from the original sets. 100 random copies were obtained for each genome. Thus, 3600 analogs of the genomes of the MERS species were obtained, 2400 − of the genomes of the SARS species, and 4200 for SARS-CoV  Note that the maximum length of periodicities in most of the original sequences is 14 for MERS and 15 for SARS and SARS-CoV-2. In some random sequences, periodicities of significantly longer lengths can be generated randomly (26 − 28), however, the average values of the maximum lengths (15.7 − 16.1) for the entire set of random texts are comparable to the values for the original genomes.
The number of periodicities in most MERS genomes is 11 (table 1), which differs little from the average number of periodicities in random sequences. In the genomes of SARS, however, the number of periodicities is already 25 − 27, which is significantly higher than the average number for random sequences and even higher than the maximum value. The number of periodicities in the genomes SARS-CoV-2 (22 − 24) is slightly less than the maximum value in random sequences, but significantly (almost 2 times) exceeds the average. This may indicate that the periodicities presented in the genomes of SARS have a significant functional capacity. This issue requires a separate study.
The dependence of the number of periodicities on the length of the period both in real genomes and in their random analogs is shown in table 7. Note again that no periodicities with a monomer length of  7 2 or more than 8 were found in any of the coronavirus genomes, but periodicities with a period length of 4 were revealed in the SARS genomes. The maximum number of periodicities is detected at a period length of 6. With a decrease in the threshold to r = 10, the number of periodicities increases approximately 3 times in the genomes of coronaviruses and 4 times in their random counterparts, and the maximum number of periodicities is reached with a period length of 5.

Conclusion
The analysis of a limited set of genomes of three species of coronaviruses (MERS, SARS and SARS-CoV-2) shows that: • There are markers for the differentiation of coronavirus species at the level of RNA sequences. As such, tandem repeating chains of symbols (periodicities) can be used. • The number of periodicities with a length exceeding the threshold value r = 12 in genomes of the MERS species is relatively small and differs little from 'random' (shuffled) sequences of the same length and with the same composition of elements. However, in the genomes of the SARS and SARS-CoV-2 species, this number is more than twice as high. • None of the periodicities identified in the genomes of the MERS species (except for the poly-a tract in the 3'UTR) is found in the genomes of SARS and SARS-CoV-2 and vice versa. • The number of periodicities common to all genomes of the SARS and SARS-CoV-2 species is approximately 2 times less than the number of periodicities characteristic of only one of the species. It is the periodicities found in almost all genomes of only 'their' species that are of greatest interest in terms of identifying the pathogenic potential of the virus. • In view of the limitedness of the initial set of genomes, the formulated conclusions are preliminary in nature and are subject to verification and refinement on a more voluminous material.