Increasing of Compression Efficiency for Genomic Data by Manipulating Empirical Entropy

Sharing bio-informatics data is the key point to constructing a mobile and effective telemedicine network that brings with it various difficulties. A crucial challenge with this tremendous amount of information is storing it reversibly and analysing terabytes of data. Robust compression algorithms come up with a high rate of text and image compression ratios. However, the achievement of these advanced techniques has remained in a limited range since, intrinsically, the entropy contained by the raw data primarily determines the efficiency of compression. To enhance the performance of a compression algorithm, entropy of raw data needs to be reduced before any basic compression which reveals more effective redundancy. In this study, we use reversible sorting techniques to reduce the entropy thus providing higher efficiency in the case of integrating into compression technique for raw genomic data. To that end, permutation-based reversible sorting algorithms, such as Burrow-wheeler, are designed as a transform for entropy reduction. The algorithm achieves a low-entropy sequence by reordering raw data reversibly with low complexity and a fast approach. The empirical entropy, a quantitative analysis, shows a significant reduction of uncertainty has been achieved.


Introduction
Genome technology plays an important role in human life, especially in the treatment of diseases.Although the cost of genome sequencing has decreased, the increase in biological data related to these sequences is abnormal.Such a large amount of genomic data poses great challenges for biotechnology centers in data storage, backup, transportation and sharing.To overcome these challenges, compression of genomic data appears as an inevitable solution.General-purpose compression methods can also be applied to genomic data.In recent years, researchers have proposed many special-purpose genome compression methods [1,2].When compared to known compression methods, recent studies show that compression ratios have been greatly improved.
In the decades-long, demands for analysis and acquiring bio-informatics data have increased tremendously.Genomic data of a species range from gigabytes to terabytes.Sharing this data is the key point to construct a mobile and effective telemedicine network that brings with it various difficulties.A crucial challenge of this tremendous amount of information is storing it reversibly and analyzing its terabytes of data [3].
Based on Human Genome Project, in a recent study of [1], it is emphasized that there does exist a huge number of unparagoned data that demand effective transmission, archiving and backup.The authors propose a spark-based genome compression method.Recently, many compression techniques have provided a high compression efficiency for text and images.Considering these vigorous methods, their achievement has remained in a limited range since, intrinsically, the entropy contained by the raw data primarily determines the level of efficiency for compression.Motivated by this fact, for improving the performance of any compression algorithm, determining the reduction of entropy of the raw data unveils a looming point for revealing more effective redundancy.
Since the frontiers of compression have been determined by the entropy of the data compressed, the manipulation of the entropy of the data is primarily significant to achieve an efficient compression performance.Recent studies [4,5,6] emphasize the necessity on compression for effective archiving and transmission of the genomics data sequences.Global and local characterization of DNA, RNA and proteins aim to estimate genome entropy to motif and region classification [7].
Estimating of entropy of the genomic sequences may provide a perspective.Motivated by this fact [8], the study of [9] is delved into the performance of the compression algorithms.To manipulate entropy, the reversible resorting algorithms and suffix-based methods such as Burrows-Wheeler transform (BWT) [10] are some of the most studied methods in the literature, that fairly deserve this drawn attentions.Usage of the BWT in data containing input data does not have many runs but contains various repetitive characters, e.g.genomic sequences, which will significantly contribute to the entropy reduction.
This paper utilizes reversible rearranging techniques to reduce the entropy of raw genomics data.Intending to achieve that result, permutation-based reversible sorting algorithms, such as Burrow-wheeler, are geared for genome data as a transform for entropy reduction.The raw data has been transformed into a lower entropy form by reordering reversibly.The empirical analysis is exploited as a quantitative analysis to show a reduction of entropy has been achieved.The succeeding reordered sequence has a higher potential for more efficiently revealing symbol redundancy using appropriate entropy encoders.Because of that the proposed method is reversible, compressed bit-streams have been totally reconstructed by the decoder.

Methodology
To measure the uncertainty of a set of elements taken from a genomics dataset, let G denotes the source of the genome data taken from restricted alphabet A = {a 1 , a 2 , . . .a |A| }.Principally, for the DNA data sequence A = {A, C, G, T } the representation of polymer structured from the mononucleotides [13].And the dictionary elements ensure ∀a i ∈ A, and the entropy [10], uncertainty of the genome data constructed from a certain DNA, can be measured using Shannon entropy (see equation 1 ) by using p(a i ) is the set of probability function of transition of elements and logarithms operation are taken to base 2. Therefore, resorting of a sequence intrinsically does not alternate total informational entropy which is the measure of the amount of uncertainty in the source data.In other words, by resorting of genome elements, the minimum number of bits for representation of source data, on average, could not be reduced.
However, to remedy the problem of high uncertainty of the raw data that aggravated the burden of bandwidth of transmission and archiving comes with empirical analysis of the raw data.Because of that the probability mass function could not be estimated from the previous observation of the genome data, the Shannon entropy equation may not be worked out.However, the occurrence of each element of a genome data gives a frequency that can be used to estimate probabilities [11].Thus, empirical entropy may be useful in the case of true uncertainty of genome data can not be determined.The empirical entropy [10], take the observation of i th symbol in the data sequence while n n (a i ) is count of occurance symbol a i during the n experiment i.e. n th entropy.Thus, n-th order entropy denoted as H(G n ) sets out the level of uncertainty of n experiments i.e., in this study it is structured as the subsets of G. Burrows-Wheeler Transform (BWT) is one of the well-known block sorting algorithms that offers more efficient compression by rearranging the input data to take advantage of repeated patterns and redundancy, making it more amenable to compression.The batches are windowedresorted-sequenced sets and acquired from the BWT [14] which is a permuting operation for rearranging genome raw data.The transformation applied to resort the DNA-sequences, The BWT-transformed sequences Ĝ has been analyzed using empirical entropy to come into view and thus any proposed compression technique could reach higher compression efficiency compared to unsorted data.

Experiments and Results
Since the entropy of raw data is a direct parameter affecting compression performance, evaluation of both Shannon and empirical entropy analyzes is an important criterion to achieve effective compression performance.At this point, various 8-bit matrix and informatics data with a limited dictionary structure (DNA or RNA genome structures with 4 elements) were analyzed throughout this study.These analyses are performed by adapting reversible BWT operation to the raw data for extending/manipulating lower frontiers of entropy.
Empirical and true entropy analysis result of the BWT process applied on an 8-bit bit depth matrix (Ds-m) and genome datasets (Ds-genI, Ds-genII) are given in Table 1.One of the important results from Table 2 is that the entropy reduction in 8-bit matrices is lower than in 4-element genome data and, as a result, empirical entropy is less.When we consider the performance of any compression algorithm, it is concluded that the redundancy detection rate will be higher than genome data.

Conclusion
Since modern compression systems are structured of three main blocks that are transformation, symbols coding and finally entropy coding and terminatively producing code-stream, the proposed empirical analysis gives a looming perspective to be able to increase both the performance of symbol coding and of whole compression systems.An image that is composed of a raster m-bit matrix and four-element DNA and RNA genomic data is utilized for entropic analysis.
The analyses show that an effective and low-computational cost transformation such as Burrow-Wheeler can extend the lower boundary of the entropy and as a consequence the compression rate of a method.The succeeding reordered sequence has a higher potential for more efficiently revealing symbol redundancy using appropriate entropy encoders such as Huffman and arithmetic encoders compared to unsorted data.Since the proposed method is reversible, compressed bit-streams have been perfectly reconstructed by the encoder.As a consequence, over 15% higher entropy reduction is achieved compared to the raw data.

Table 1 .
Shannon and Empirical entropy analysis for different datasets.For the dataset Ds-genI, true entropy is reduced by %15.94.For the remaining datasets, the amount of reduction is given in Table2in terms of the percentage and a metric distance.

Table 2 .
Euclidean distance between raw and transformed data and percentage of the reduction of entropy for different datasets.