PAI-SAE: Predicting Adenosine To Inosine Editing Sites Based On Hybrid Features By Using Spare Auto-Encoder

Adenosine-to-inosine RNA editing is an important post-transcriptional modification, which converts adenosines to inosines in both coding and noncoding RNA transcripts. Therefore, this modification can result in the diversification of the transcriptome. It is significant to accurately identify adenosine-to-inosine editing sites for further understanding their biological functions. Given an uncharacterized RNA sequence that contains many adenosine resides, can we identify which one of them can be converted to inosine, and which one cannot? To meet the increasingly high demand form most experimental scientists working in the area of drug development, we have developed a new predictor called PAI-SAE by hybrid features combining with dinucleotide-based auto-cross covariance (DACC), pseudo dinucleotide composition (Pse DNC) and nucleotide density, followed by a spare auto-encoder model. It has been observed via rigorous jackknife test that the predictor PAI-SAE is superior to others in this area.


Introduction
RNA editing is a post-transcriptional process, selectively inserting and deleting single nucleotide, or converting one nucleotide to another [1]. There are two major types of RNA editing in mammals: one is C-to-U (cytidine to uracil), the other, a much more common type, is A-to-I (adenosine to inosine) [2]. A-to-I editing usually takes place under the control of the enzyme ADARs (adenosine deaminases that act on RNA) that bind dsRNA (double-stranded RNA) structures [3]. In this catalytic process, a targeted adenosine (A) within these structures is deaminated into inosine (I), and inosine (I) can be recognized as guanosine (G), because of the similar functions to G by the cellular machinery [4]. Many biological mechanisms, such as RNA stability, localization, splicing, miRNA function and translation, are affected by the A-to-I editing event. Therefore, it is significant to accurately identify adenosine-to-inosine editing sites for further understanding their biological functions.
With the progress of RNA sequencing technology, identifying A-to-I editing sites has entered into the perspective of researchers. For example, the next-generation sequencing has been successfully 2 1234567890 ''"" used to identify hundreds of human A-to-I editing sites in non-Alu regions since 2009 [5]. And A-to-I editing sites were accurately identified in H.sapiens by transcriptome sequencing in 2012 and 2014 [4,6,7] . Following these works, A-to-I editing sites were successfully detected in M.musculus on the basis of RNA-Seq method [8].
Although great successes have been achieved in this regard, it is expensive and time-consuming to identify A-to-I editing sites by means of the standard laboratory methods. Facing the explosive growth of RNA sequences discovered in the postgenomic age, it is highly demanded to develop computational approach to help getting the information. Very recently, in a pioneering study, St Laurent et al. [9] proposed an interesting method to identify A-to-I editing sites in D. melanogaster via an iterative feedback loop of computational prediction and experimental validation. But no web-server has been provided for their method, and hence its practical application value is quite limited. For this reason, Chen et al. [10] proposed a prediction model "PAI" on the basis of support vector machine by using pseudo dinucleotide composition method to identify A-to-I editing sites in D. melanogaster in 2016. And the corresponding web-server was constructed. The next year, a predictor "iRNA-AI" [11] based on support vector machine was constructed by using the chemical properties of nucleotides and nucleotide density. In view of its importance and urgency, it is certainly worthwhile to further improve the prediction quality by introducing some novel approaches as elaborated below.
In this study, we constructed the new hybrid features by combining DACC [12], PseDNC [13,14] and nucleotide density [15] and spare auto-encoder [16,17] to develop a new predictor "PAI-SAE" to identify A-to-I RNA editing sites in D. melanogaster aimed at improving its Matthew correlation coefficient(MCC) and accuracy(ACC), the two most important and harshest metrics for predictor.

Benchmark Dataset
St Laurent [9] et al. Sequenced the RNAs of the D. melanogaster to carry out genome-wide studies of adenosine-to-inosine RNA editing with single molecular sequencing in 2013. Based on experimental data, after removing redundant sequences by using CD-HIT [18], Chen et al. [10] constructed the benchmark dataset S including subset S  composed of 125 adenosine-to-inosine editing site sequences and subset S  composed of 119 non-adenosine-to-inosine editing site samples. The benchmark dataset for the current study can be formulated as: Where the symbol  represents the union of the subsets.  Table 1.

Dinucleotide-Based Auto-Cross Covariance.
With the development of computer technology, many feature vectors that are used to represent sample sequences would be directly generated by the web server, such as Pse-in-One [19], repRNA [20], and repDNA [21], without need to go through the complex mathematical details. Open the Web page by clicking the link at http://bioinformatics.hitsz.edu.cn/Pse-in-One/ and click on the serve button, you can see three different efficient tools for feature extraction including PseDAC-General, PseRAC-General and PseAAC-General and choose the second one for RNA sequences. After selecting the mode dinucleotide-based auto-cross covariance (DACC) and corresponding above-mentioned eleven physicochemical properties, you can easily obtain the desired results. Then, the necessary parameter 'lag' must be set. Experiments show the best results can be obtained when the value of the parameter lag is 4.
Generally, a RNA sequence R can be expressed as Where L represents the length of sequence R . Then, in accordance with the above procedure, the sample sequence R can be formulated by a 484dimensional feature vector shown as below.
The derivation process of the Eq. (3) was described in detail in references [12,22].

Pseudo Dinucleotide Composition (Pse DNC).
According to the references [10,23], based on the above-mentioned eleven physicochemical properties, the sample sequence R can be defined as Where λ , the number of sequence order correlation factors, is an integer and must be smaller than Where ω is the weight factor; j θ is called the j th-tier correlation factor; k f is the normalized occurrence frequency. And the feature vector formulated by Eq. (4) can be also directly generated by the web server Pse-in-One.
Here, the last few components of the feature vector, that can show the sequence order information, are adopted to represent the RNA sequence, as shown below. 17 16 Experiments show that the best results can be obtained when the parameter λ and ω are set to 5 and 0.3, respectively. Then we can obtain a 5-dimensional feature vector.

Nucleotide Density.
As described in the references [11,15], the concept of nucleotide density was proposed to reflect the frequency of a nucleotide and its distribution in a given RNA sample sequence R formulated by Eq. (2). Then the corresponding feature vector can be expressed as Where i P is the density of the nucleotide i R at position i of a given RNA sample sequence with L nucleotides, and

Feature Fusion.
In order to increase the degree of discrimination of RNA sequences and further improve the performance of a predictive model, we can incorporate the above-mentioned three different feature extraction methods into a fusion vector to express the sample sequence formulated by Eq.(2), as shown below.
2.2.6. Sparse Auto-Encoder. As a popular classifier, sparse auto-encoder has been successfully applied in bioinformatics field [24][25][26]. In this paper, we construct a sparse auto-encoder with two hidden layers to identify A-t-I sites. In order to achieve this more effectively, we can use the deep learning software package that can be downloaded from the website: https://github.com/rasmusbergpalm/DeepLearnToolbox. In this package, after using the SAE and NN, we can obtain the optimized results through the optimization of the parameters. The predictor is called 'PAI-SAE', where 'P' stands for 'predicting', 'AI' for 'A-t-I editing sites' and 'SAE' for 'sparse auto-encoder'.

Prediction Quality Examination.
In general, there are four conventional metrics, i.e. Accuracy (ACC), Sensitivity (Sn), Specificity (Sp), and Matthew correlation coefficient (MCC), that are widely used to examine the performance of a predictor in the field of bioinformatics, as formulated by Where TP represents the number of the A-t-I editing sample sequences correctly predicted as the A-t-I editing sample sequences; TN, the number of the non-A-t-I editing sample sequences correctly predicted as the non-A-t-I editing sample sequences; FP, the number of the non-A-t-I editing sample sequences incorrectly predicted as the A-t-I editing sample sequences; FN, the number of the A-t-I editing samples incorrectly predicted as the non-A-t-I editing samples.

Cross-Validation.
As most scientists on biology have done, we use the validation methods to score the above-mentioned four metrics. Generally, there are three cross-validation methods, namely independent dataset test, K-fold cross-validation test and jackknife test. Although K-fold crossvalidation test has more advantages in the computational time, the jackknife test can yield the unique outcome for a given benchmark dataset. Therefore, the jackknife test is adopted to examine the predictor's performance in this paper.

Result and Discussion
Listed in Table 2 are the rates obtained by the current PAI-SAE predictor via the jackknife test on the benchmark dataset. For facilitating comparison, listed in that tables are also the corresponding results obtained by the PAI, the existing most powerful predictor based on Supper Vector Machine for identify A-t-I editing sites in D. melanogaster.
As shown in Table 2, the scores of the four metrics used to quantitatively measure the quality of a single-label predictor, the new predictor "PAI-SAE", are higher than those of the predictor "PAI". For example, the ACC of our predictor "PAI-SAE" gains 2.46 per cent. The MCC rate has increased by 4.14 per cent, the Sn rate, by 1.60 percent, and the Sp rate, by 3.36 per cent. As pointed out in a comprehensive review, among the aforementioned four metrics, the most important are MCC and ACC. The high success rates shown in Table 2 clearly indicate that the current predictor is not only a pioneer one in this area, but also holds very high potential to become a high throughput tool for both basic research and drug development.

Conclusion
Identification of adenosine-to-inosine editing sites in RNA sequences is important for the intensive study on RNA function and the development of new medicine. In this paper, a new predictor called PAI-SAE was constructed based on hybrid features combining with DACC, PseDNC and nucleotide density by using spare auto-encoder. The jackknife test results of the predictor PAI-SAE on the benchmark dataset show that our predictor is superior to others in this area. And the results were promising enough for our predictor to be used as an analytic solution to more genomic problems.