A Hybrid Framework for Effective Prediction of Online Streaming Data

In this paper, we present a hybrid model to perform the training and testing of prediction model with online streaming data. Prediction of online streaming data is a time critical task. Huge volume of data that is being generated online need to be ingested to a prediction model and to be used to train and test the prediction model dynamically which improves the learning rate. The existing approaches for dynamic training and testing use the local infrastructure or virtual machines from the cloud infrastructure to increase the learning rate of the prediction model with streaming data. Recently many applications prefer serverless cloud infrastructure than virtual machines. However, using the serverless infrastructure for the entire prediction process will have time and space tradeoffs due to its autonomic feature. Hence in this paper we propose a hybrid approach that uses the three different environments such as the local infrastructure, virtual machine and serverless cloud for different stages. A novel approach to select the suitable environment to train and test the LSTM based air quality prediction model with stream data is proposed with increased learning rate and reduced resource utilization.


Introduction
The capability of an application to process a data while it is being generated is called as stream processing. The input for this application will be from various sources. In most case the data is collected continuously from different real-time sensors. Due to this features stream data requires parallel and distributed computing resources for quick data processing. Also selecting a suitable environment for processing stream data is a big challenge. With the introduction of cloud, most of the stream processing applications have move to cloud. However, selecting a suitable type of cloud service for different parts of the streaming application workflow will have considerable cost and time benefits. Particularly using serverless cloud computing architecture for streaming applications will be most gainful for users. Initially online streaming was used in stock market for forecasting the Sensex and Nifty using distributed processing. Recently stream processing is popularly used for real-time data processing. As it is the imminent and suitable technology for IoT sensors, there is a rapid growth in the number of applications using these two technologies.
As people are more concerned about the quality of the air, they breathe every day, real-time air quality monitoring becomes an important application. Due to the higher cost, setting up an air quality monitoring system in all the areas is not possible and are implemented in areas where high spatial variability is found. However effective methods for fast processing of the air quality data at a cheaper cost are the need of the day.
Hence, in this research we have proposed a novel air quality prediction model using LSTM technique. The novelty in this approach is selecting appropriate resources at every stage of the prediction model.

Serverless Cloud
Serverless cloud is the current buzzword for cloud users. The overheads in provisioning and releasing the virtual machines are replaced by serverless cloud computing. Users enjoy the freedom of invoking everything as a function and pay only for the period during which computing resources are utilized to complete the task. The configuration of the resources will be decided based on the memory requirements of the applications. Cost of serverless functions depends on response time rounded to nearest 100ms, Function size and overhead per execution. The function completion time also depends on the input as well as the function execution in different context. Because of its inherent features the serverless cloud is very suitable for scheduling workflows. The dynamic structure resource requirements of tasks in a workflow will be fulfilled by the serverless cloud infrastructure.
Many popular cloud service providers offer serverless computing service. The AWS Fargate from amazon allocates the right amount of compute resources, eliminating the requirement to select an instances and scaling capacity. The users have to pay only for the resources required to run their containers. There is no need to pay for additional servers and eliminates over provisioning and under provisioning. It runs each task in its own kernel thus providing an isolated environment for each of them. The workload isolation and enhanced security features has attracted several customers to run even their mission critical applications in fargate. The architecture of serverless clouds using distributed containers proposed by (Soltani et al., 2018) describes the benefits of serverless cloud architecture in a distributed environment. An optimized approach to reduce the execution cost in serverless infrastructures (Pérez et al., 2018) using container based architecture is also a very good step to understand the feasibility of serverless computing. The serverless computing is also suitable for processing big data applications. The experiment by (Giménez-Alventosa et al., 2019)for execution map reduce applications in AWS lambda was proved to be effective.

Review of Literature
There are abundant research contributions from researchers towards online stream processing. The research contribution by Barddal et al., (2020) used the Kolmogorov-Smirnov as well as the Population Stability metrics for comparing the performance of batch processing against online streaming. The ensemble-based algorithm proposed by (Junior & Nicoletti, 2019) uses an updating technique that enhances the model flexibility with quick knowledge acquisition that overcome the error rate. Nguyen et al., (2019) propose a decay mechanism to identify the age of the incomes data and compared them with several bench mark algorithms.
Tavasoli et al., 2019, proposed technique for processing complex data streams that are produced by non-stationary stochastic methods. The self-adjusting learning model to utilize the multiplicationbased update algorithm to update the data whenever a new element occurs is also explained. Din & Shao, 2020, propose a new data stream classification technique to capture the time-changing concept, to dynamically maintain a set of online micro-clusters and learns. The work proposed by (Sun et al., 2020) achieves better prediction performance than the state-of-the-art online structure learner for SPNs, while promising speed up in the order-of-magnitude.
A study of the three batch classification algorithms C4.5, k-nearest Neighbor, and Support Vector Machine was presented (Shi et 2019), introduced a decentralized directory service useful for connecting edges from multiple stakeholders. Recently the use of application programming interfaces on the new service introduced by Watson Machine Learning is also gaining momentum for transforming a huge quantity of data generated by organizations. The use of distance algorithm to synthesize and refine correlation data using machine learning is also a noteworthy technique.

Implementation
The novel approach to select the suitable environment for training and testing the air quality prediction system is proposed in this work which has a huge potential for implementation. The prediction model is developed in the stand alone desktop, trained and tested with historical dataset. Once it is achieved with expected learning rate, it can be deployed into virtual machine if the size of the ingesting stream data is huge else the serverless cloud can be used for a moderate size of ingesting stream data. When talk about the streaming data, it is not necessary to use the complete stream data for training. Instead, only the batches where there is a pattern change can be considered for training the prediction model. Because, when there is no change in the pattern of stream data, no much change in the learning rate of the prediction model. Hence, only extracted streaming data can be considered for training the prediction model with micro services. As we implement serverless cloud for training and testing process the time and cost will be reduced to a reasonable extent, thus increases the possibility of deploying the system to achieve a high learning rate. Algorithm 1 shows the rule followed in the proposed approach.
Algorithm 1: Rule followed for selecting suitable environment

Input
: Type of Data Set Output : Suitable Resource for processing Data Set if 'historical dataset' then select "standalone desktop"for existing dataset else if 'high volume of streaming data' then select 'virtual machine' else if 'moderate/low volume of streaming data' then select 'serverless cloud' else if 'high volume of streaming data along with historical dataset" then select "standalone desktop"for existing dataset select "virtual machine"for high volume of streaming data else if 'moderate/low volume of streaming data with historical dataset" then select "standalone desktop"for existing dataset select "serverless cloud"for moderate/low volume of streaming data The rules are framed based on the flow of data rate. The systematic process is depicted below. The overall architecture of the proposed system is given in Figure 1. The diagram shows the different stages of the air quality management workflow and how resources are selected at different stages.The resource utilization differences can be monitored with the help of this architecture. The flow of data rate is taken into consideration for setting up the rules.

Figure1. Overall architecture of the proposed prediction model
The prediction models are generated using linear regression and LSTM. These models are trained in the standalone desktop with the historical air quality dataset. The LSTM model performed better than linear regression model with improved accuracy. Performance and accuracy of model is measured using Root Mean Square Error (RMSE), Mean Absolute Error (MAE) are used to evaluate the results. Equations for RMSE and MAE are given in equations 2 and 3 respectively.
where is the actual value, ̃ is the predicted value and n is the total number of observations. ∑ | ̃ | Equation (3) where n is the total number of predictions, y is the predicted value. The same models are deployed in virtual machine as well as serverless cloud using micro services. The accuracy of prediction is measured using the same Root mean square as well as mean absolute error. The accuracy of both the models are not compromised based on the deployment environments. But, the resource utilization varies between different environments for both models. Training and Testing duration are recorded for various prediction models. As well as the resource utilization is monitored and logged. All these three parameter values are taken into analysis and found that the resources are utilized efficiently in the hybrid environment than the individual environments.

Results and Discussion
The air quality dataset collected from Central Pollution Control Board, Delhi. The dataset contains 1hr air pollution data for the 3 months. It contains totally 2650 records. The prediction models were developed using linear regression and LSTM. These models were trained and tested in various environments to check the time complexity and resource usage. The time taken to train and test the models varies on each environment and the results are depicted in the below table.

Prediction Models
The prediction models are designed using both Linear Regression and Deep LSTM. The figure 2 shows the linear regression model for NO 2 and the figure 3 shows the Deep LSTM model for the same NO 2 . The prediction accuracy is achieved more in the Deep LSTM model than the Linear Regression model.   The LSTM model performs better than the Linear Regression models. These models were deployed in various environments. However, there is no much influence of the accuracy of the model. The MAE and RMSE occurred in the various environments are shown the Fig. 2 & 3 respectively. From the figure 4, it is observed that the mean absolute error is comparatively less in the hybrid approach. Also figure 5, clearly shows that the root means square error is also less in the hybrid approach.

Figure 5. Root means square error occurred in the various environments
The proposed approach provides clarity of implementing the prediction model in the various environment as well as suggesting the suitable environment to reduce the cost and complete resource utilization.

Conclusions and Future Enhancements
The hybrid frame work proposed in this research uses different types of resources at every stage of the prediction process. Resources such as standalone computers, virtual machines and serverless cloud are selected based on the novel selection algorithm proposed in this paper. Air quality data with 2650 records is used for evaluating the proposed work. The implementation result shows that the hybrid approach increases the resource utilization and reduces the training and testing times from5% to 35%. Also the model is compared with liner regression and deep learning based LSTM Methods. The deep learning based has better performance when compared with the linear regression method. Hence using the proposed hybrid framework with deep learning based LSTM is capable of predicting the online streaming data faster and efficient that the other existing methods. In future we would like extend this work for prediction the video data set.

References
[1] Barddal J P, Loezer L, Enembreck F and Lanzuolo R 2020 Lessons learned from data stream classification applied to credit scoring Expert Systems with Applications