Towards Cloud-based Asynchronous Elasticity for Iterative HPC Applications

Elasticity is one of the key features of cloud computing. It allows applications to dynamically scale computing and storage resources, avoiding over- and under-provisioning. In high performance computing (HPC), initiatives are normally modeled to handle bag-of-tasks or key-value applications through a load balancer and a loosely-coupled set of virtual machine (VM) instances. In the joint-field of Message Passing Interface (MPI) and tightly-coupled HPC applications, we observe the need of rewriting source codes, previous knowledge of the application and/or stop-reconfigure-and-go approaches to address cloud elasticity. Besides, there are problems related to how profit this new feature in the HPC scope, since in MPI 2.0 applications the programmers need to handle communicators by themselves, and a sudden consolidation of a VM, together with a process, can compromise the entire execution. To address these issues, we propose a PaaS-based elasticity model, named AutoElastic. It acts as a middleware that allows iterative HPC applications to take advantage of dynamic resource provisioning of cloud infrastructures without any major modification. AutoElastic provides a new concept denoted here as asynchronous elasticity, i.e., it provides a framework to allow applications to either increase or decrease their computing resources without blocking the current execution. The feasibility of AutoElastic is demonstrated through a prototype that runs a CPU-bound numerical integration application on top of the OpenNebula middleware. The results showed the saving of about 3 min at each scaling out operations, emphasizing the contribution of the new concept on contexts where seconds are precious.


Introduction
One of the key features of the cloud includes the elasticity, where users can scale at any moment their resource consumption up or down according to either the demand or the desired response time [1,2]. Considering the HPC landscape and a very long running parallel application, a user may want to increase the number of instances to try and reduce the completion time of the application. On the other hand, if an application is not scaling in a linear or close to linear way, and if the user is flexible with respect to the completion time, the number of instances can be reduced. This results in a lower nodes × hours index, and thus in a lower cost and energy saving. Although there are benefits to HPC systems, cloud elasticity has been more extensively explored on client-server Web architectures, such as video on demand, online stores, Although pertinent for bag-of-tasks and key-store HPC applications, replication techniques and centralized load balancers are not useful by default to implement elasticity on tightlycoupled HPC applications, as those modeled as Bulk-Synchronous Parallel (BSP), Divide-and-Conquer or pipeline [2,7]. This happens because any resource (de)allocation causes a process reorganization as well as the updating of the whole communication topology, not only the interaction between the load balancer and the target replicas. In addition, there is a problem related to virtual machine consolidation, which can result in a sudden termination of a process and its disconnection from the communication topology; and consequently, resulting in the application crash. Most parallel applications have been developed using the MPI 1.x, which means that they do not have any support for changing the number of processes during the execution, so applications cannot explore elasticity without an appropriate support [8]. While this changed with MPI version 2.0, significant effort is needed at application level both to manually change the process group and to redistribute the data to effectively use a different number of processes. Figure 2 (a) depicts a situation in which elasticity controls are implemented inside the application code using the cloud-supported API. This strategy requires user expertise on cloud monitoring, besides the selection of the appropriate points to insert the calls. Part (b) of Figure  2 explores the use of an elasticity controller outside the application, which is normally offered as optional component in platforms such as Amazon and Windows Azure [9]. Resource monitoring, as well as allocation and deallocation of VMs are tasks belonging to the controller, but users must both insert calls in their applications and handle the communication topology reorganization. The call of the elasticity() method represents a link between the application and the controller, so the use of a controller without it has no effect in load balancing because of the application is not able to detect and use new resources [10]. To bypass these limitations, some approaches impose code rewriting [2,11], previous configuration of elastic rules and actions [2,12,13,14], former knowledge of the application phases [2,12,13,14], and the stop-reconfigure-and-go [2] mechanism, to obtain gains from resource reconfiguration.
Aiming at providing cloud elasticity for HPC applications in an efficient and transparent other HPC programming styles, such as pipeline and BSP. AutoElastic's contribution relies on the concept of asynchronous elasticity: transparent resource and process reorganization at user perspective, neither blocking nor stopping the application execution at any resource allocation or deallocation action. To accomplish this, AutoElastic provides a framework with a controller that transparently manages horizontal elasticity actions, i.e., without requiring any application modification or adptation. Taking at starting point Figure 2 (b), our approach offers a framework to hide all shading boxes from the user. Although the standard use of a controller enables the setup of VMs in parallel to the application runtime, the benefits of the new resources are not transparent to the users. As discussed earlier, scaling in operations also appear as a problem in the standard utilization of a controller, since a consolidation of one or more VMs will sudden terminate the processes residing on them, which can imply in a premature application ending. The proposed model assumes that the target HPC application is iterative by nature, i.e., it has a time-step loop. This is a reasonable assumption for the most of MPI programs [15,16], so this does not limit the applicability of our model. This article describes AutoElastic and a prototype developed with OpenNebula. Tests with a CPU-bound numeric integration application show gains up to 26% when using AutoElastic in comparison with a static provisioning. The remainder of this article will first introduce the related work in Section 2, pointing out open issues and research opportunities. Section 3 is the main part of the article, describing AutoElastic's framework together with asynchronous elasticity concept in details. Section 4 describes a prototype implementation. Evaluation methodology and results are discussed in Sections 5 and 6. Finally, Section 7 emphasizes the scientific contribution of the work and notes several challenges that we can address in the future.

Related Work
Elasticity is one of the most attractive features of cloud computing because it allows users to scale resources on-demand. There are different ways of using the elasticity provided by cloud infrastructures, such as manual setup [17,18,19], and by pre-configuration of reactive elastic mechanisms [20,9]. While the former is not suitable for applications that need automatic and transparent elasticity, the latter implies in rather complicated tasks for non-cloud savvy users (e.g., define thresholds and elasticity actions).
Middleware solutions for building elastic computing infrastructures, such as OpenStack (https://www.openstack.org), OpenNebula (http://opennebula.org), Eucalyptus (https://www.eucalyptus.com) and CloudStack (http://cloudstack.apache.org), commonly offer elasticity through manual mechanisms (e.g., command line and graphical tool that allow users to control virtual machines). Complementary solutions such as Elastack [21], which provides automated monitoring and adaptation functions, can be integrated with OpenStack-like systems to provide dynamic infrastructure elasticity. However, it works only at the infrastructure level, i.e., applications have to be made aware that nodes can be started or shut down at any time. In other words, it is up to the developers to ensure any kind of consistency or failure tolerance in the applications.
More recently, different research initiatives started to look at how elasticity can be leveraged by HPC applications. As an example, ElasticMPI proposes an elasticity framework for MPI applications through the stop-reconfigure-and-go approach [2]. However, this approach can negatively impact the performance of applications, in particular those that do not have long execution times. A second drawback of ElasticMPI is that it requires applications to be modified. Another approach, named Auto-elasticity [22], considers a pre-defined auto-elasticity by adjusting the number of VM instances accordingly to the application's input data (workload). In other words, as Auto-elasticity assumes that a program is modeled on deadline basis, the number of VMs is pre-defined in order to meet the deadlines. Most of the existing solutions that provide cloud elasticity for high performance applications are commonly built around the master-slave programming model [2,11,23]. In case of iterative applications, which is the most common one, it means that at each new loop the master redistributes the tasks to slaves [2,11]. However, in most cases the elasticity of the system is provided in a reactive way at the IaaS level, i.e., without knowledge of on-the-fly information from the applications. Summing up, current approaches suffer from different issues such as (i) lack of mechanism to verify whether the application reached (or not) its peak load when achieving a load balancing threshold value [21,23]; (ii) extra complexity at the application level, i.e., the code needs to be instrumented and/or reorganized [2,11]; (iii) static elasticity defined by pre-execution information [2,14]; (iv) reconfiguration of the application's resources using a stop-and-relaunch approach [2]; and (v) assume that the communication latency between any two VMs is constant [24].
Considering the scope of MPI applications, Raveendran, Bicer and Agrawal [2] proposed one of the most advanced approaches to support the execution of such kind of applications. Nevertheless, as mentioned above, their solution needs application data in advance to feed the elasticity middleware and the insertion of elasticity code in the MPI application, besides the need to stop and relauching the whole application when elasticity takes place. Observing the initiatives described here, we are prooposing AutoElastic as a first step towards addressing the aforementioned issues (i), (ii), (iii), and (iv). In other words, our solution does not add any extra code or complexity to the existing HPC applications, allows dynamic (runtime) elasticity, and enables on-the-fly reconfiguration of resources without having to stop and relaunch the application.

AutoElastic Model
Traditionally, HPC applications are executed on clusters or even in grid architectures. In general, both have a fixed number of resources that must be maintained in terms of infrastructure configuration, scheduling (where tools such as PBS 2 , OAR 3 , OGS 4 are usually employed for resource reservation and job scheduling) and energy consumption. In addition, the tuning of the number of processes to execute a HPC application can be a hard procedure: (i) both short and large values will not explore the distributed system in an efficient way; (ii) a fixed value cannot fit irregular applications, where the workload varies along the execution and/or sometimes it is not predictable in advance. On the other hand, cloud elasticity abstracts the infrastructure configuration and technical details about resource scheduling from users, who pay for resources, and energy consequently, in accordance with the application's demands. However, the main gaps between the duet HPC and elasticity are application modeling and the overhead related to scaling out operations. Aiming at addressing these gaps, we propose AutoElastic -a cloud elasticity model that operates at the PaaS level of a cloud, acting as a middleware that enables the transformation of a non-elastic parallel application in an elastic one. Thus, AutoElastic was proposed as a solution to answer questions such as: AutoElastic provides transparent horizontal and reactive elasticity for parallel applications, i.e., without requiring the intervention of the programmer (also named here as cloud user) to specify sets of rules, actions, or modify the application's code. Figure 3 (a) illustrates the traditional approaches of providing cloud elasticity to HPC applications, while (b) highlights AutoElastic's idea. The approach proposed by AutoElastic allows users to submit a traditional, non-elastic aware, application to the cloud, while the framework takes care of resource reorganization through automatic VM allocation and consolidation procedures. As AutoElastic works at the granularity of virtual machines, it has to be aware of the VM instantiation overhead to provide seamless elasticity, i.e., in a non-prohibitive way for HPC applications.  Figure 3. General ideas on using elasticity: (a) standard approach adopted by Amazon AWS and Windows Azure, in which the user must pre-configure a set of elasticity rules and actions; (b) AutoElastic idea, contemplating a manager that coordinates the elasticity actions and configurations on behalf of the user.

Architecture
AutoElastic is a middleware that operates as PaaS (Platform as a Service) that allows nonelastic parallel applications to take advantage of cloud elasticity without any change. To provide elasticity, it works with scaling in and out operations that consolidate or allocate virtual machine instances, respectively. Figure 4 depicts the AutoElastic architecture, presenting the framework components and the mapping of VMs. The framework includes a Manager, which can be either assigned to a virtual machine inside the cloud or to act as a stand-alone program outside the cloud. This is possible by taking advantage of cloud-supported APIs. As HPC applications are commonly CPU-bounded, we opted to create a process per VM and c working VMs per computing node, where c refers to the number of computational cores inside the node. This design decision has been previously investigated and validated as a way of exploring the efficiency of large computing nodes [25]. In addition, Figure 4 also presents the first ideas regarding the scope of HPC applications, presenting VMs that execute master and slave processes. The AutoElastic Manager monitors the virtual machines, taking elasticity actions when considering them as pertinent for the current hardware and application behavior. The user can inform a file with an SLA (Service-Level Agreement) containing the minimum and the maximum number of allowed VMs to execute the application on the cloud. If no SLA is provided, the  default upper bound of virtual machines is two times the number of VMs used when launching the application. Instead of offering an application-sided elasticity, the use of a manager brings the benefits to resource reorganization in an asynchronous way at the application perspective, not penalizing it on VM (de)allocation actions. However, this non-blocking operation implies in the following question: How can we notify the application about the resource reconfiguration?
We can achieve this goal through a framework that implements the concept of asynchronous elasticity.
Asynchronous elasticity is a way of asynchronously notifying applications regarding changes on the underlying infrastructure, such as the number of computing instances. For instance, the application is notified as soon as a new computing VM instance (scale out) is available in the system without impairing its normal execution flow.
AutoElastic provides a framework that implements the concept of asynchronous elasticity. One of its key elements to provide asynchronous elasticity in a transparent fashion is a shared data area, which is used to provide interaction between the AutoElastic Manager and the VMs inside the cloud. Shared data areas are a common practice for sharing data between VM instances on cloud infrastructures [17,18,19]. They can be implemented by different means such as network file systems, message-oriented middlewares, and tuple spaces. Thus, AutoElastic uses the shared data area as a manner to combine HPC application and cloud elasticity, so providing actions as presented in Table 1. The shared data area provides three types of notifications, as summarized in Table 1. Action 1 is an asynchronous notification sent by the AutoElastic Manager to the application announcing new ready to use computing resources. Figure 5 illustrates the functioning of the AutoElastic Manager when creating a new slave, so launching Action 1 afterwards. Action 2 is required for two reasons: (i) to avoid abruptly finishing a running process, which might lead to data losses; (ii) to ensure that the application will not be aborted due to a sudden interruption of a process. This second rationale is particularly important for MPI applications that execute over TCP/IP networks since they are usually aborted when a process abruptly disconnects. Finally, Action 3 is a decision taken by the master process that avoids inconsistent global state during the application's execution. In other words, once Action 2 has been received, the master process does not dispatch any task to the specific slaves which belong to the node that will be consolidated. The shared data area plays a key role in this process since it keeps all processes updated regarding any resource reconfiguration, allowing a safely adaptation to the new network topology.  AutoElastic uses VM replication to provide cloud elasticity for HPC applications [26]. When scaling out the Manager launches new virtual machines using a pre-defined VM template. If the current nodes are working at full capacity, the Manager will first allocate a new computing node to launch the new VMs. The bootstrap of a VM is a time consuming procedure (e.g., boot time of the operating system) that finishes with the execution of a slave process. This slave automatically requests a connection to the master process, completing the asynchronous elasticity cycle. The master process will include the new slaves in the process group without any disruption or interruption on the application's execution. After that, the new slave processes will normally receive tasks from the master. The consolidation (scale in) takes place at a node granularity and not at the VM or process level. This design decision seeks to explore efficiency and energy saving, not using the power of a computing node partially. In fact, it has been claimed before that the number of VMs or processes inside a node is not the main factor for energy saving, but the fact that the node is turned on or off [27] .
Similarly to previous work [20,28], AutoElastic performs resource monitoring periodically. Considering a monitoring interval, AutoElastic captures the CPU metric and computes a time series based on the lower and upper thresholds [29]. Particularly, thresholds are largely used in the state-of-the-art of cloud elasticity to drive resource reorganization on CPU-bound applications [1,2,4,28]. AutoElastic uses the concept of moving average over a specific number of load observations to generate a single metric value; so elasticity actions are triggered on situations in which one of the thresholds violates this metric. To accomplish this, we are collecting CPU data using the function LP (Load Prediction), as presented in Equations 1 and 2. M A(i, j) informs the CPU load of a virtual machine j at the observation number i. It performs a moving average considering the last x observations of load C, taking at start point the observation number i. Using this value, we compute an arithmetic average, so establishing an average load for the system by using the function LP (i). In this function, n refers to the number of virtual machines in execution. Action 1 is triggered if LP is larger then the upper threshold, while Action 2 takes place when LP is shorter than the lower threshold.

Model of Parallel Application
AutoElastic explores data parallelism on iterative message passing applications, which are modeled following the master-slave parallel programming model. This parallel programming model is extensively used in genetic algorithms, the Monte Carlo technique, geometric transformations for 2D and 3D images, asymmetric cryptography and SETI@home-like applications [2]. However, it is worth emphasizing that the framework allows the existing processes of the HPC application to know the identifier of the new instantiated processes, i.e., enabling also a all-to-all communication topology. In other words, it means that AutoElastic supports also applications such as BSP and Divide-and-Conquer. For developing the communication framework, we investigated the semantics and syntax of both MPI 1.0 and 2.0. While the former statically creates all processes at launching time, the latter supports dynamic process creation and on-the-fly reconfiguration of the connection topology. It means that MPI 2.0 is suitable for elastic environments. The AutoElastic parallel applications follow the MPMD (Multiple Program Multiple Data) principle, where master and slave processes have different executable codes. Each type of binary is mapped to a different VM template. The idea is to offer application decoupling for processes with different purposes, enabling flexibility and making the implementation of elasticity easier. Listing 1 presents a pseudocode of an AutoElastic-supported iterative application. The master code executes a series of tasks, capturing each one sequentially and parallelizing one-by-one to be processed on slave processes. This behavior can be observed in the external loop (line 2). Currently, AutoElastic works with the following MPI 2.0-like communications directives: (i) publication of a connection port; (ii) looking for a server, taking as starting point a connection port; (ii) connection request; (iv) connection accept and; (v) disconnection request. Different from the approach in which a master launches processes using the so-called spawn() directive, AutoElastic acts in accordance with the second MPI 2.0 approach to support dynamic process creation: Sockets-based point-to-point communication. The launching of a new VM automatically entails the execution of a slave process, which requests a connection to the master automatically, as presented in Listing 2. Here, we emphasize that an AutoElastic-supported application does not need to necessarily rely on the MPI 2.0 API, but only follow the semantics of the communication directives. c h a n g e s += add VMs ( ) ; 4 . the IP addresses of each process. Taking into account this information, the master knows the number of slaves and creates port names to receive connections from the slave processes. The communication happens asynchronously, where the master sends data to slaves in a non-blocking fashion but receives data from them synchronously. In fact, loop-based programs are convenient to implement cloud elasticity because it is easier to reconfigure the number of resources in the beginning of each iteration without changing the application semantics. Moreover, the job distribution loop is where the global consistent state of the system is kept.
The user must not insert any line about cloud elasticity in the code of the application. AutoElastic middleware manages the transformation of a non-elastic application into an elastic one at the PaaS level by one of the following strategies: (i) polymorphism can overload a method to provide elasticity for object-oriented implementations; (ii) a source-to-source translator can be used to insert code between the lines 1 and 2; (iii) a wrapper for the function in line 3 can be developed for procedural languages. Independent of the strategy, the code required for elasticity is simple, as shown in Listing 3. First, we need to verify if there is a new action from the AutoElastic Manager in the shared data area. If Action 1 has been activated, the master process reads the information regarding the new slaves and knows that it must expect new connections from them. In the case of Action 2, the master removes from its group the processes that belong to a specific node. After doing that, it triggers Action 3.
Although the design of AutoElastic takes into account master-slave applications, the iterative modeling and the use of MPI 2.0-like directives makes it easy to add and remove processes, as well as establishment completely new and arbitrary topologies. At the implementation level, it is possible to optimize connection and disconnection procedures if a particular slave process remains among the active ones in the process list. This improvement can benefit TCP-like connections that require a three way handshake protocol, which might be expensive for some applications.

Implementation
We developed an AutoElastic prototype for OpenNebula-based private clouds. The OpenNebula Java API, which was used for developing the AutoElastic Manager, provides the resources required to control both resource monitoring and scaling in and out activities. Moreover, the API is also used to launch parallel applications in the cloud. To run the processes, we created two VM templates, one for the master and another for the slaves. Following, we present some technical decisions in the prototype implementation: • We used the WS-agreement XML standard 5 to define an SLA, which specify the minimum and maximum number of VMs for the tests; • The shared data area was implemented through NFS, enabling all VMs inside the cloud infrastructure to access the files. The AutoElastic Manager, which can run outside of the cloud, uses the SSH protocol to access the shared data area on the front-end node; • The load LP for the monitoring observation number i denoted LP (i) is computed using the moving average of the slave VMs, considering an windows with 3 observations; • The interval used for monitoring data was 30 seconds; • Based on the related work (see Section 2), we defined 40% and 80% as the lower and upper thresholds, respectively.

Parallel Application and Evaluation Methodology
We developed a numeric integration application to evaluate the gains with and without asynchronous elasticity. The idea was to observe the benefits (e.g., gains in performance, such as reduced execution time) of cloud elasticity for HPC applications. The application computes the numerical integration of a function f(x) in a closed interval [a, b]. In the implementation, we used the Composite Trapezoidal rule from a Newton-Cotes postulation [30]. where The values of x 0 and x s in Equation 5 are equal to a and b, respectively. In this context, s means the number of subintervals. Following this Equation, there are s + 1 f (x)-like simple equations for obtaining the final result of the numerical integration. The master process must distribute these s+1 equations among the slaves. Logically, some slaves can receive more work than others when s+1 is not fully divisible by the number of slaves. Thus, the number of subintervals s define the computational load for each equation.
Aiming at analyzing the parallel application on different input loads, we considered four patterns: Constant, Ascending, Descending and Wave. Table 2 and Figure 6 show the equation of each pattern and the template used in the tests. The iterations in this figure mean the number of functions that are generated, resulting in the same number of numerical integrations. Additionally, the polynomial selected for the tests does not matter in this case because we are focusing on the load variations and not on the result of the numerical integration itself. Table 2. Functions to express different load patterns. In load(x), x is the iteration index at application runtime. Descending .00125 500000 Figure 7 shows a graphical representation of each pattern. The x axis in the graph expresses the number of functions (one function per iteration) that are being tested, while the y axis informs the respective load. The load means the number of subintervals s between the limits a and b, which in this experiment are 1 and 10, respectively. The larger the number of intervals is, the greater the computational load for generating the numerical integration of the function. For the sake of simplicity, the same function is employed in the tests, but the number of subintervals for the integration varies. Considering the cloud infrastructure, OpenNebula is executed in a cluster with 10 nodes. Each node has two processors, which are exclusively dedicated to the cloud middleware. AutoElastic Manager runs outside the Cloud and uses the OpenNebula API to control and launch VMs. Our SLA was set up for a minimum of 2 nodes (4 VMs) and a maximum of 10 nodes (20 VMs).   Table 2.

Evaluation and Discussing Results
We evaluated the numerical application using four load patterns in two scenarios: enabling and disabling cloud elasticity. At each execution, the initial configuration considers 2 nodes, the first executing 2 VMs (2 slave processes) and the second executing 3 VMs (2 slave processes and the master). We collected two metrics, the time (in seconds) to execute the application and the number of load observations performed by AutoElastic during the execution. At each observation i, we have the number of VMs execution on that moment, as well as the result for LP (i). The results can be seen in Table 3. The last column shows the cost according to Equation 3. As can be seen in the Table 3, when elasticity is enabled, the loads Ascending, Descending and Wave used different numbers of VMs during the application execution time. On the other hand, the load Constant used the same configuration in both scenarios with elasticity disabled and enabled. This behavior happened because the LP (i) remained between the lower and upper thresholds, i.e., no elasticity operations were necessary. On the other hand, the execution time and the amount of observations are lower in executions where resource reorganizations happened.
In the Ascending load with elasticity enabled, 47.7%, 40% and 12.3% of the observations respectively. In this case, as the load grows and decrease during the execution, it needs more resources in the beginning and after varies between the scenarios with 6 and 8 VMs. Figures 8 and 9 illustrate the execution time of the application and the total cost obtained on each scenario. The elastic execution outperforms the non-elastic execution in the Ascending, Descending and Wave patterns, presenting performance gains of 18%, 26% and 22%, respectively. This behavior was also perceived when observing the cost, where AutoElastic with elasticity support resulted in costs approximately 14%, 11% and 10% lower than those with the nonelastic execution for the same mentioned load patterns. Considering that we are allocating more resources on-the-fly to avoid bottlenecks in the application's execution, elasticity helped to reduce the execution times, as can be seen in Table 3. Although using more resources, the gain in the time metric is enough to provide the lower values of cost in favor of the elastic execution. In other words, when compared with the non-elastic execution, AutoElastic uses more resources, which is compensated in terms of execution time. AutoElastic with elasticity disable AutoElastic with elasticity enable Figure 9. Cost obtained to execute the parallel application in the different scenarios and loads. Figure 10 depicts a comparison regarding the history of resource allocation when combining load patterns and scenarios. We are not considering the Constant pattern because it does not cause elasticity actions. As expected, we allocate resources in specific moments in the Ascending pattern, while the Descending pattern shows a behavior of allocation in the beginning and a single deallocation in the end of the application. We leave as a future work a deeper analysis of the impact of variable thresholds. More precisely, Figure 10   In the environment testbed, the procedure of allocating new resources comprises the transferring of two VMs to a new node in a 100 Mbps network and the initialization of the VMs afterwards. Each VM is based on a template with 700 MBytes in size. During the whole phase of allocating new VMs, the application executes normally with the current resources. The resource reorganization is performed only after completely delivering the new VMs. Table 4 presents the instants in time when new resources were allocated in the tests (see Figure 10). In this table, "VM allocation" represents the instant (including both number of the observation and application time in this instant) in which the LP (i) function violates the threshold so triggering a new resource allocation. The term "VM delivering" represents the moment in which previously allocated resources were deallocated, i.e., detached from the application. The average time between the start of a resource allocation and its delivering to the application is about 214 seconds.

Conclusion
This article addressed the cloud elasticity for iterative HPC applications through the proposition of the AutoElastic model. AutoElastic self-organizes the number of virtual machines without user intervention, bringing benefits both to the cloud administrator (better energy saving and  resource sharing among the users) and for the cloud users (who can take profit from a better performance and a quickly application deployment in the cloud). Section 3 presented three problem statements that were addressed as follows: (i) AutoElastic acts at PaaS level, not requiring that the programmer write elasticity actions and rules in the application code to provide an elastic execution. It also offers asynchronous elasticity, which proved relevant to enable the use of HPC applications in the cloud computing environment. (ii) The current version of AutoElastic works with master-slave iterative applications, not needing prior information about their behavior. AutoElastic provides a framework totally compatible with tightly-coupled applications, so models such as BSP and Divide-and-Conquer can be adapted in the future to take advantage of cloud elasticity. Concerning the performance gains with cloud elasticity, the evaluation showed that it is possible to reduce about 18% to 26% the execution time of a numerical integration application. (iii) We are assuming that the user developed an iterative application, providing VM templates both for the master and the slave processes. Moreover, the user has an option to submit an SLA when launching the application. If not provided, AutoElastic takes as default twice the number of VMs at this time for the largest infrastructure.
Our approach for application model is justified by the fact that HPC programs can be developed with the Sockets-like MPI 2.0 programming style. This style allows process connection and disconnection easily, providing an effective use of available resources. AutoElastic offers a reactive and horizontal elasticity, going against the sentence claimed by Spinner et al. [31], who affirm that only vertical scaling is suitable for HPC scenarios due to inherent overhead related to the complementary approach. Thus, we modeled a framework to provide the novel concept of asynchronous elasticity, which turned out as a crucial feature to enable automatic resource reorganization without prohibitive costs. The aforesaid performance results are emphasized when analyzed together with the consumed energy, showing that the AutoElastic's elasticity does not present a forbidding cost.
As a future work, we intend to explore the self-organization of the thresholds in accordance with the application feedback. Finally, as explained earlier, we also plan to extend AutoElastic to contemplate other parallel programming models, including Divide-and-Conquer and BSP.