COVID-19 mRNA Vaccine Degradation Prediction Using LR and LGBM Algorithms

The threatening Coronavirus which was assigned as the global pandemic concussed not only the public health but society, economy and every walks of life. Some measurements are taken to stifle the spread and one of the best ways is to carry out some precautions to prevent the contagion of SARS-CoV-2 virus to uninfected populaces. Injecting prevention vaccines is one of the precaution steps under the grandiose blueprint. Among all vaccines, it is found that mRNA vaccine which shows no side effect with marvellous effectiveness is the most preferable candidates to be considered. However, degradation had become its biggest drawback to be implemented. Hereby, this study is held with desideratum to develop prediction models specifically to predict the degradation rate of mRNA vaccine for COVID-19. Two machine learning algorithms, which are, Linear Regression (LR) and Light Gradient Boosting Machine (LGBM) are proposed for models development using Python language. Dataset comprises of thousands of RNA molecules that holds degradation rates at each position from Eterna platform is extracted, pre-processed and encoded with label encoding before loaded into algorithms. The results show that LGBM (0.2447) performs better than LR (0.3957) for this study when evaluated with the RMSE metric.


Introduction
SARS-CoV-2, an air-borne virus that resulted millions of death since the end of the year 2019 and treatment specifically to the cause of this virus, the COVID-19 is yet to be excavated [1]. Even effective vaccines for preventive are currently under study. Although mRNA vaccines are the one that entrusted with highest hopes amongst, it has a drawback of rapid degradation.
Refer to research done by Wadhwa et al. [2] in year 2020, upon the occurrence of in vitro transcription, degradation will greatly reduce the yields of mRNA. In addition, this paper claims that under refrigerated cool chain transport condition, the half-life of the mRNA vaccine may have a halflife of 900 days with a rate of at least 2% of degradation every 30 days [2]. It is worth noting that with a digression of temperature to around 37°C, or merely 2 units drift of pKa value, it is believed to reduce dramatically the half-life of a vaccine to 5 days and 10 days, respectively [2]. In addition to this, with  2+ with a temperature of 37°C, customarily the condition for in vitro transcription, the vaccine half-life will further reduce to not more than 2 hours [2]. The result remains the same even after reducing the concentration Mg 2+ , pH value nor temperature to alleviate the hydrolysis where mRNA vaccine is unstable to as degradation still occurs during the transcription process [2]. After vaccination into a human body, a 5-days half-life [2] is estimated for the mRNA vaccine.
Although Abbasi stated that researchers believe that this drawback can be solved with a second doseregimen of the candidate vaccine [3], but this degradation issue should not be overlooked as the potency of a vaccine can never be restored nor regained once it is lost. Parenthetically, owing to a finale highstability mRNA vaccine is yet to be developed, and it may take up months to years for a positive result to be bear, what we can do as for now is to implement the candidate vaccine in hand to control this pandemic. And with this, the stability of the vaccine should be as precise as possible, which in other words of saying, the degradation rate of the vaccine that easily alters by both extrinsic and intrinsic factors should be clear and definite, and of course, same goes to the future-successfully-developed vaccine as well.
As stated, at this juncture, the study on the degradation of mRNA vaccine is extremely crucial. Nevertheless, study and research on predicting the degradation of mRNA or even vaccines are extremely limited, not to mention regarding mRNA vaccines for COVID-19. The only research accessible currently on this topic is a study published by Ankit Singhal in late 2020 with LSTM, GRU and GCN algorithms, evaluated with RMSE showing GCN-based model (0.249) is the finest [4].
Thence, this study is focused to develop models and design rules using machine learning algorithms to predict the degradation rate of mRNA vaccine. The proposed model will be used to predict the degradation rates at each base of an RNA molecule which was trained on a subset of an Eterna dataset that comprises of 6034 RNA molecules that hold degradation rates at each position.

Methodology
The main purpose of this study is to develop reliable model that able to predict the degradation rate of COVID-19 mRNA vaccine. In general, there are 3 main stages, namely, data pre-processing, model training and performance evaluating.
For this study, the BPPs NumPy file that holds the probability for each base of RNA to be paired is extracted together with the train and test datasets from Eterna database platform [5] that consists of numbers of RNA molecules that hold degradation rates at each position. A number of features will be engineered from the BPPs dataset and only those features that are suitable for prediction among all those engineered features will be selected.
Pre-processing will then take part to eliminate noises and to organize the data for training and testing purpose. After done processing the data, converting non-numerical data to numerical data, 2 algorithms, the LR and LGBM are proposed and trained with train dataset for models development. The performance of the model develop on test dataset is evaluated with Root Mean Square Error, RMSE. From the RMSE values resulted, we will choose the best model amongst. The general methodology flow chart is shown in Figure 1.

Dataset
Dataset is a constellation of data, may be in form of an array, a data structure whose elements are denoted from the same data type, or in a database table form that mayhap possess different data types. The three most common data types in ML are numerical, categorical and ordinal data. In the dataset, the rows of the data are designated as instances also known as the observation collected, while the columns could represent either the features or the classes. Features are the independent characteristics, while classes are the dependent output that we intend to predict. The dataset can be in form of strings, dates or a more labyrinthine type too. Datasets usually assorted into 2 different sets, training dataset and testing dataset, each with different purposes.

Train Dataset.
The train dataset of this study encompassed of 2400 instances with 19 features, including index and id. It is fortunate that there are no missing values in the dataset for each instances and features. Table 1 shows the list of features for this study.

Bpps Dataset.
Bpps is the abbreviation for base-pairing probabilities, educe that bpps symmetric square matrix NumPy file that escorts both train and test datasets possesses probability of forming a base pair for each base of the RNA. The bpps attached are the NumPy arrays calculated with algorithms developed by [5]. The bpps matrices are prepared for all the instances in both the train and test datasets, one for each row, each base in the sequence each.

Data Pre-processing
Data Pre-processing is the percussive manoeuvre before analyzing the data with a view to transmogrify the raw noisy data to a clean yet simpler data to minimize disparage of quality of data analysis as erroneous, inadequate or inconsequential data will entail faulty prediction that give rise to dire performance [6].

Handling Missing Data.
Inadequacy of information or missing data is an inevitable challenge that customarily occurs in genuine data sources while analysing data. The occurrence of this type of mistake may arise from the account of lost or oblivion [6]. Inapplicability of features to instances so does the nonchalant feature value are some additional reasons behind the deficiency of feature values [6]. Missing value issue should be solved as most of the algorithms do not have the authority to endorse missing values in dataset fed. With Python command '.isnull()', there is no missing value in datasets extracted.

Data Cleaning.
It is worth notifying that mRNA with noisy results ought not to be used for actual vaccine development. Therefore, to ensure only sublime samples are fed to the model, the instances were filtered based on stipulated criteria referring to the corresponded signal_to_noise and SN_filter features as mentioned in Table 1. As mentioned, if all the criteria are passed by the instances, SN_filter will be denoted with 1. Thence, to ensure the models are able to perform as pre-eminence as possible, train dataset is strained, with only those instances that passed the SN_filter are considered, resulting in reduction of number of instances in train dataset from 2400 to 1589.

Label Encoding.
In ML, data can be categorised into 3 main explicit data type categories: numerical, categorical and ordinal. Although some of the models can handle diverse types of data, there are still a considerable amount of algorithms cannot meet the desired aptitude. Consequently, data are recommended to be modified from non-numerical datatype data to numerical datatype data for proper processing. Thence, label encoding which is a simple yet splendid encoding technique with impressive performance [7] is proposed to encode the 3 non-numerical features, sequence, structure and predicted_loop_type. The arrays, either 1 × 107 or 1 × 130 characters long, will first be split into 1 × 1 character long, resulting an increase in the number of instances from thousands to hundred thousand, together with their corresponded features, the 5 classes and the error columns. After done producing 1 × 1 variable instances, the characters is label-encoded individually as shown in Table 2. After the label encoding is done on the filtered train dataset and test dataset, the instances increase dramatically from 1589 to 108052 for train dataset and from 3634 samples to 457953 samples for the test dataset.

Feature Engineering
The preeminent feature engineering [8] is a process of preliminarily processing the raw data into a more presentable and breezier form features that are compatible with algorithms for modelling to improve their prediction performances [9]. Superb quality features are believed to suffice in mitigating the storage load, saving the storage space and cutting down effectively the processing time required [10]. In this prediction study, we will manoeuvre feature engineering on bpps matrices dataset into features in form of aggregate functions, max, mean, nznbr, std, and sum which will then be analyzed with data visualization techniques.

ML Algorithms 2 ML regression algorithms, LR and
LGBM are implemented to develop models for this regressionbased supervised learning study. Models' prediction errors are evaluated with RMSE.

Linear regression, LR.
LR is a common yet well-known supervised machine learning algorithms among both rookies and experts in the field of data science [11]. LR algorithm operation is perfectly simple and understandable, leaving no doubts to the user by fitting a regression line to the data and dexterously expounding the relation between variables (dependent with independent). LR is a traditional algorithm that is well-known to have a good performance in regression problem resulting it to be proposed in this regression-based study. The off-the-shelf example can be seen from the research done by Bayrak and Ogul [12] in bioinformatics filed, predicting the true value of gene expression.

Light Gradient Boosting Unit, LGBM.
LGBM is an enhanced gradient boosting framework that utilized decision tree (DT), tree-based learning algorithms, and therefore it is also assigned as the histogram-based DT algorithm [13].
LGBM is proved to have a better performance in speed as compared to DT [14]. At times, it may show an outshine accuracy and precision than DT too. Research done by Zhan et al. [13] proved that LGBM is reliable in bioinformatics regression prediction. On a different note, the studies that compare the performance of linear regression with LGBM is in minority, thereupon, the result of this study is worth looking forward to.

Result and Discussion
With the intention to determine suitable algorithms in developing prediction models for COVID-19 mRNA vaccine degradation, algorithms proposed need to be trained with and evaluated with a suitable and reliable performance metric. On the other hand, as mentioned in Section 2.3, data visualization technique will be utilized to determine suitable bpps aggregate-functions-features induced as the quality of the features will candidly dictate the virtuosity of the analyte after loaded into a model.

Data Visualization
As sequence length varies between instances in test dataset, the instances are categorized into 2 different categories, namely public test with seq_length valued at 107, and private test with seq_length equals to 130. To be noticed, only the first 68 (for seq_length = 107) and 91 (for seq_length = 130) with experimental data is considered owing to experimental constraints [5].
Hazardous features may have sprouted attributable to the incongruity in sequence length of instances in dataset, and these perilous features should be avoided from using since we can never assure that it may or may not induce the occurrence of overfitting and some other undesired problems.
To determine whether the developed bpps aggregate-functions-features induced from the bpps numpy files are suitable and safe to be utilized, distribution curves are delineated. Treacherous features will show different distribution compared to others. To handle, we may choose either to normalize them fastidiously with extreme care if we adjudge to consider them as input features to train the model, or just simply neglect them.   Figure 3, it seems that only both bpps_max and bpps_sum are innocuous to be utilized but on the other hand, bpps_nznbr, bpps_mean and bpps_std distribution curves show discordant between the train and private test data and therefore excluded.

RMSE Performance Metrics
RMSE is a performance metric for regression which measures the average magnitude of errors by considering the square root of the mean of the squared differences between the predictions and the ground truth [15]. The formula for RMSE metric is shown as below in equation (1) where n represents the number of instances.
RMSE is a negative-oriented scoring technique which implies the lower the RMSE value, the better the performance of the model. The score is ranged from zero to positive infinity due to the presence of square applied to the difference between ground truth and predicted values. In addition, RMSE has a magnificent performance when coping with large error values as the difference will be more divulging after the squaring in RMSE, making RMSE more perceptive to outliers.

Prediction Performance
To reduce the computational time but remaining the accuracy [16], the filtered train dataset is cross validated into 10 folds. The ML models' performance evaluated with RMSE on each of the classes together with their overall result are presented in Table 3.  Table 3, the RMSE prediction metric-evaluated performance of LR and LGBM models on the datasets shows that LGBM's prediction performance surpassed the performance of LR in this study for the given dataset across all the classes. The prediction error of LGBM shows a value of over 0.15 lower than LR. Therefore, we may deduced that LGBM is more suitable than LR for this degradation performance task assigned.

Conclusion
Referring to the result obtained, we may conclude that for this study, LGBM is more congruous than LR for mRNA degradation predictions. LGBM-based model's performance even surpass the models developed by [4] with LSTM, GRU and GCN each with k = 4.
Although the result is laudable, the evident constraints of this study should not be overlooked. One of the biggest defects is the length of the RNA sequence studied. In practice, mRNA vaccine for COVID-19 used to be in the range of 3000 to 4000 bases long [4] but this study wields only 107 to 130 bases. Amelioration such as acquainting the model with longer mRNA vaccine sequence and evaluate the performance error to permit better utilization and reliability in predictions.
Besides, we may consider utilized 10-fold cross-validation on the DL algorithms such as GRU proposed by [4] in future research and observe the performance whether there is any improvement in predictions then compare the result between the DL algorithms and ML algorithms proposed.