Auto-Scaling Cloud Resources using LSTM and Reinforcement Learning to Guarantee Service-Level Agreements and Reduce Resource Costs

Auto-Scaling cloud resources aim at responding to application demands by automatically scaling the compute resources at runtime to guarantee service-level agreements (SLAs) and reduce resource costs. Existing approaches often resort to predefined sets of rules to add/remove resources depending on the application usage. However, optimal adaptation rules are difficult to devise and generalize. A proactive approach is proposed to perform auto-scaling cloud resources in response to dynamic traffic changes. This paper applies Long Short-Term Memory (LSTM) to predicting the accurate number of requests in the next time and applies Reinforcement Learning (RL) to obtaining the optimal action to scale in or scale out virtual machines. To validate the proposal, experiments under two real-world workload traces are conducted, and the results show that the approach can ensure virtual machines to work steadily and can reduce SLA violations by up to 10%-30% compared with other approaches.


Introduction
Hundreds of thousands of users are enjoying various Internet services. The economic model of pay-peruse behind cloud computing allows companies to rent enough resources for a certain period and access them via Internet [1]. Cloud providers, Amazon EC2 and Google App Engine, offer cloud resources to the application providers in the form of Virtual Machines (VMs) with the scalability feature and payper-use charging model [2] [7]. In particular, the application provider tends to be aware of the application environment and the required Quality of Service (QoS) [3] [4]. On the one hand, the over-provisioning problem can occur when the cloud resources for a certain user exceed its demands, which consequently leads to zero Service Level Agreement (SLA) violations, but resources are wasted and as a result application provider costs arise. On the other hand, the under-provisioning problem can occur when the cloud resources are not suitable for the required QoS, which consequently results in interruption or delayed response to user requests, even worse, causes system crash. In order to find a trade-off between the SLA requirements of the application and the cost of renting resources, the resource schedule method is vital. A desirable solution would require an ability to predict the incoming workload and allocate resources ahead of time, which in turn will enable the application to be ready to handle resource provision problems when it actually occurs [5] [6]. The contributions of this paper are 1 : 2) This paper customizes LSTM to predict incoming workload using historical workload effectively.
3) This paper applies RL as a decision-maker that uses the historical resource utilization and the predicted results in order to obtain the optimal action to scale in or scale out VMs. 4) This paper conducts a series of experiments to evaluate the performance of proposed approach under real-world workload traces for different metrics compared with the other approaches.
The rest of this paper is structured as follows. Section 2 reviews prior work on auto-scaling. Section 3 provides details of the proposed approach. Section 4 presents performance evaluation and discusses the experimental results, and Section 5 concludes the paper and presents future works.

Related Works
There are a great number of studies have been conducted with regard to the dynamic resource provisioning, this section is an overview of existing research in the field of dynamic resource provisioning for cloud applications. IBM [8] has proposed a reference model for autonomic computing, which is called the control MAPE loop, including monitoring phase (M), analysis phase (A), planning phase (P), execution phase (E). In the monitoring phase, a monitoring component collects the information about the resources and the workload submitted by users, and this information is used to estimate future resource utilization in the analysis phase. In the planning phase, a suitable action (e.g., scale in or scale out) and how many scale VMs should be allocated is determined. Finally, the suitable action is performed in the execution phase. The four phases are executed regularly at specific time intervals. Huang et al. [9] focused on proactive analysis of scaling indicators, they made use of double exponential smoothing based resource prediction model, considering the current state of resources and history records. Mohamed et al. [10] paid attention to a purely reactive policy using the response time, due to their method is not able to cope with variations in demand without causing system instability, it faces a challenge that some delays may happen until the resource is ready for use. Messias et at. [11] proposed an adaptive prediction method combining time-series prediction models with genetic algorithms, their method did not need lots of historical data to train, because the method was able to adapt to various types of time series and adapt the extent to the incoming workload.

Algorithmic Framework
Algorithm I shows an overview of the algorithm's operations: At the beginning, algorithmic initializes appropriate number of VMs based on historical experience. Then, the MAPE loop executes until there is no service running in the VM, especially, the algorithm carries out monitoring phase continuously every second, while executes analysis phase, planning phase and execution phase at specified intervals.

Monitoring phase
In monitoring phase, a monitor continuously collects information about the system, the information includes the number of requests and the CPU utilization during the ∆t interval. Finally, this information is stored in a knowledge respectively, and can be used by other phases (see algorithm II).

Analysis phase
Last phase has obtained the historical number of requests, in this phase, a workload analyzer is used to predict the number of requests in the next interval by utilizing forecasting techniques. In this paper, the long short-term memory(LSTM) is applied in order to deal with fluctuating requests. LSTM networks are a special kind of Recurrent Neural Networks (RNN), capable of learning long-term dependencies.
The major feature of LSTM networks is their ability to retain and persist information for long sequences or periods of time, as a result, LSTM networks are well suited for time series forecasting.

Planning phase
This paper applies RL as a decision-maker to obtaining the optimal action. RL problems can generally be modelled using Markov Decision Processes (MDPs). MDPs are a particular mathematical framework suited to modelling decision making under uncertainty. As shown in Table 1, the system state is considered as under-utilization if the CPU utilization is less than the lower threshold, and the system state is considered as over-utilization if the CPU utilization is greater than the upper threshold. Otherwise, the system state is considered as normal-utilization. This paper uses Q-learning algorithmic for the implementation of the decision-maker. The Q-learning algorithmic is defined by the Q function, which is expressed by Eq. (1): where S t is the state of system at the time t, a t is the agent's action at the time t. Q(s, a) is the value function that the agent at the time t performing action at the current states, which plays the role of the agent's brain, r s, a is the reward after the agent applies a t , and the state transits from S t to S t+1 . γ is a discount factor(0<="γ" <=1), a value close to 1 for γ assigns a greater weight to future rewards, whereas a value close to 0 considers only the most recent rewards. When selecting actions, the agent has a probability to randomly select one rather than choose the best one on the basis of current knowledge, So reward function (R(s, a)) is defined as a probability value as shown in Table 2.

Execution phase
In this phase, the VM manager applies the optimal actions according to the result of the planning phase. The VM manager creates new VMs and dispatch the load balancer according to the best-fit strategy, or released latest initiated VM.

Experimental Setup
To evaluate our approach, this paper uses the NASA traces [12] and the ClarkNet traces [13] represents user's requests respectively. The NASA-HTTP contains two months' worth of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. The ClarkNet-HTTP contains two week's worth of all HTTP requests to the ClarkNet WWW server. These workload traces represent realistic load variations over time that makes the results and conclusions more realistic and reliable to be used in real environments. In the experiment, the proposed approach with other three strategies are compared. The first strategy is labeled "Worst", which adds or releases VM when the system state is over-utilization or underutilization. The second strategy is referred to as "Linear", which is based on linear method to predict the workload and allocate resources ahead of time [14]. The third strategy is referred to as "LSTM", which is focused on the Time Series forecasting of CPU usage of machines LSTM [6]. Finally, our approach is referred to as "LSTMQL", which is combined LSTM to predict requests in the next time and RL to obtain the optimal action.

Performance Metrics
To evaluate the performance of the whole working system, following three performance metrics were used and compared with other algorithms, are described as: 1) SLA violation: SLA is used to monitor the quality of service, performance, response time from the service point of view. This paper is especially concerned about SLA violation and the delay time is used to approximate the SLA violation. 2) VMs allocated: This metric is defined as the number of VMs allocated to a cloud service Si at the ∆th interval. This is a secondary indicator.
3) The CPU utilization: The CPU utilization represents the utilization of resources. Table 3 shows the comparison of four approaches. Firstly, the "Worst" approach and the "Linear" approach get the worse results, resulting in bad user experience. The "Worst" approach rely solely on the current state, can not allocate VMs ahead of time. Although the "Linear" approach uses historical data to predict the requests, the accuracy is not as good as LSTM approach. Secondly, LSTMQL approach also has fewer scheduling times. Of course, LSTMQL approach needs more virtual machines, because it ensures enough machines to deal with workload spikes. The number of additional virtual no more than 100 machines compared with other approaches. Thirdly, the LSTM approach has better results than other approaches. However, when the requests suddenly appear peak, there will be a shake, and the CPU utilizations even reach 120%. Although LSTM can get more accurate prediction results, it does not use empirical data in planning phase. The Linear approach also has the same problem. While LSTMQL approach can get the stable results, this is because RL decides whether the system needs to schedule VMs, avoiding unnecessary resource scheduling.

Experimental results and Discussion
In overall, the proposed approach enables the virtual machines to work steadily, decreases SLA violations by up to 20%-30% compared with "Worst" approach. The reason behind this performance can be summed up: First, LSTM network is more accurate in predicting future requests and the system can allocate appropriate resources ahead of time. Second, RL decides whether the system needs to schedule VMs, avoiding unnecessary resource scheduling, especially when requests suddenly appear peak, RL make the right decision based on historical experience and current system state

Conclusion and Future Work
This paper proposes an efficient autonomic resource provisioning approach based on the control MAPE loop. The approach applies LSTM to predicting the requests in the analysis phase and uses RL to obtain the optimal action to scale in or scale out virtual machines in the planning phase. This paper evaluates the performance of the proposed approach under real world workload traces and compare the results with other approaches. Experimental results demonstrated that the approach ensures virtual machines to work steadily in unstable environment and reduces SLA violations by up to 20%-30% compared with other approaches. The work only considers homogenous tasks schedule. In future work, effective resource schedule approaches for heterogeneous tasks and fork-join tasks in cloud system will be explore.