Jorge G. Barbosa , Altino M. Sampaio , Hamid Harabnejad

Experiments on cost/power and failure aware scheduling for clouds and grids Jorge G. Barbosa, Altino M. Sampaio, HamidHarabnejad Universidade do Porto, Faculdade de Engenharia, LIACC Porto, Portugal, jbarbosa@fe.up.pt

Outline • Dynamic Power- and Failure-aware Cloud Resources Allocation for Sets of Independent Tasks • A Budget Constrained Scheduling Algorithm for Workflow Applications on Heterogeneous Clusters COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Dynamic Power- and Failure-aware Cloud Resources Allocation for Sets of Independent Tasks • Cloud computing paradigm • Dynamic provisioning of computing services. • Employs Virtual Machine(VM) technologies for consolidation and environment isolation purposes. • Node failure can occur due to hardware or software problems. • Image source: http://www.commputation.kit.edu/92.php COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Characteristics • Dependability of the infrastructure • Distributed systems continue to grow in scale and in complexity • Failures become norms, which can lead to violation of the negotiated SLAs • Mean Time Between Failures (MTBF) would be 1.25h on a petaflop system(1) • Energy consumption • The main part of energy consumption is determined by the CPU • Energy consumption dominates the operational costs Task n Task 1 Task 2 Task 3 VM 1 VM 2 VM 4 VM n ... VMM VMM VMM VMM PM 1 PM 2 PM 3 PM m PM – Physical Machine • (1) S. Fu, "Failure-aware resource management for high-availability computing clusters with distributed virtual machines," Journal of Parallel and Distributed Computing, vol. 70, April 2010, pp. 384-393, doi: 10.1016/j.jpdc.2010.01.002. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Related Work (1) Optimistic Best-Fit (OBFIT) algorithm - Selects the PM with minimum weighted available capacity and reliability. (2) Pessimistic Best-Fit (PBFIT) algorithm - Selects also unreliable PMs in order to increase the job completion rate. - Selects the unreliable PM p with capacity Cp such that Cavg + Cp results in the minimum required capacity • Dynamic allocation of VMs, considering PMs’ reliability • Based in a failure predictor tool with 76.5% of accuracy • Proposed architecture for reconfigurable distributed VM (1) • Cavg average capacity from reliable PMs. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Approach • The goal • It is a best-effort approach, not a SLA based approach; • Virtual-to-physical resources mapping decisions must consider both the power-efficiency and reliability levels of compute nodes; • Dynamic update of virtual-to-physical configurations (CPU usage and migration). • Construct power- and failure-aware computing environments, in order to maximize the rate of completed jobs by their deadline COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Approach • Multi-objective scheduling algorithms are addressed in three ways: • 1- Finding the pareto optimal solutions, and let the user select the best solution. • 2- Combination of the two functions in a single objective function. • 3- Bicriteria scheduling which the user specifies a limitation for one criterion (power or budget constraints), and the algorithm tries to optimize the other criterion under this constraint. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Approach • Leverage virtualization tools • Xen credit scheduler • Dynamically update cap parameter • But enforcing work-conserving • Stop & copy migration • Faster VM migrations, preferable for proactive failure management Power consumption CPU% 100 CPU Increasing 0 PM3 VM time PM2 VM VM PM1 VM VM VM –Failure – Stop & copymigration –Failurepredictionaccuracy COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

System Overview • Cloud architecture • Private cloud • Homogenous PMs • Cluster coordinator manages user’ jobs • VMs are created and destroyed dynamically • Users’ jobs • A job is a set of independent tasks • A task runs in a single VM, which CPU-intensive workload is known • Number of tasks per joband tasks deadlines are defined by user • Private cloud management architecture COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Power Model • Linear power model P = p1 + p2.CPU% • Power Efficiency of P • Completion rate of users’ jobs • Working Efficiency • Example of power efficiency curve (p1 = 175w, p2 = 75w) Measures the quantity of useful work done (i.e. completed users’ jobs) by the consumed power. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Proposed algorithms • Minimum Time Task Execution (MTTE) algorithm • Slack time to accomplish task t • PM icapacity constraints • Selects a PM if: • It guarantees maximum processing power required by the VM (task); • It has higher reliability; • And if It increases CPU Power Efficiency. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Proposed algorithms • Relaxed Time Task Execution (RTTE) algorithm 100% VM Host CPU 0% CAP • Cap set in Xen credit scheduler • Unlike MTTE, the RTTE algorithm always reserves to VM the minimum amount of resources necessary to accomplish the task within its deadline COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Performance Analysis • Simulation setup • 50 PMs, each modeled with one CPU core with the performance equivalent to 800 MFLOPS; • VMs stop & copy migration overhead takes 12 secs; • 30 synthetic jobs, each being constituted of 5 CPU-intensive workload tasks; • Failed PMs stay unavailable during 60 secs; • Predicted occurrence time of failure precedes the actual occurrence time; • Failures instants, jobs arriving time, and tasks workload sizes follow an uniform distribution; COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Performance Analysis • Implementation considerations • Stabilization to avoid multiple migrations • Concurrence among cluster coordinators • Algorithms compared to ours • Common Best-Fit (CBFIT) • Selects the PM with the maximum power-efficiency and do not consider resources reliability • Optimistic Best-Fit (OBFIT) • Pessimistic Best-Fit (PBFIT) COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Performance Analysis • Migrations occurring due to proactive failure management only: • Failure predictor tool has 76.5% of accuracy; RTTE algorithm presents the best results; • Working efficiency, as well as the jobs completion rate, decreases with failure prediction inaccuracy. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Performance Analysis • Migrations occurring due to proactive failure management and power efficiency: • Sliding window of 36 seconds, with threshold of 65% (a migration starts if CPU usage below 65%); • RTTE returns the best results for 76.5% failure prediction accuracy; • Comparing to earlier results, the rate of completed jobs diminishes, since the number of VMs migrations increases. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Performance Analysis • Number of migrations occurring due to failure management and power efficiency • RTTE and MTTE have stable number of migrations and respawns along failure accuracy variation • Migrations occurring due to proactive failure management only (75% accuracy) • RTTE and MTTE return the best working efficiency as the number of failures in the cloud infrastructure rises COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Conclusions (1) • Conclusion remarks: • Power- and failure-aware dynamic allocations improve the jobs completion rate; • Dynamically adjusting cap parameter of Xen credit scheduler prove to be capable of obtaining better jobs completion rate (RTTE); • Excessive number of VM migrations to optimizing power efficiency reduces job completion rate. • Future directions: • Dynamic allocation considering workload characteristics; • Data locality; • Scalability; • Compare/integrate DVFS feature; • Improve PM consolidation (why 65% threshold?); • Heterogeneous CPUs. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Outline • Dynamic Power- and Failure-aware Cloud Resources Allocation for Sets of Independent Tasks • A Budget Constrained Scheduling Algorithm for Workflow Applications on Heterogeneous Clusters COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

A Budget Constrained Scheduling Algorithm for Workflow Applications on Heterogeneous Clusters • A Job is represented by a workflow • A workflow is a Directed Acyclic Graph (DAG) a node is an individual task CPU1 CPU2 CPU3 an edge represents the inter-job dependency • Workflow scheduling • Mapping Tasks to Resources • Main goal is to have a lower finish time of the exit task COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Introduction Target platform: - Utility Grids that are maintained and managed by a service provider. - Based on user requirements, the provider finds a scheduling that meets user constrains. In utility Grids, other QoS attributes than execution time, like economical cost or deadline, may be considered. It is a multi-objectiveproblem. Multi-objective scheduling algorithms are addressed in three ways: 1- Finding the pareto optimal solutions, and let the user select the best solution; 2- Combination of the two functions in a single objective function; 3- Bicriteria scheduling which the user specifies a limitation for one criterion (power or budget constraints), and the algorithm tries to optimize the other criterion under this constraint. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Proposed Algorithm Heterogeneous Budget Constraint Scheduling Algorithm (HBCS) • HBCS has two phases: • Task Selection Phase : • We use Upward rank to assign the priority to tasks in the DAG • Processor Selection Phase : • We combine both objective functions (cost and time) in a single function; the processor that maximizes that function for the current task is selected. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Proposed Algorithm Heterogeneous Budget Constraint Scheduling Algorithm (HBCS) 0<=k<= 1 (Objective function) COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Experimental Result • Workflow Structure: • Synthetic DAG generation • (www.loria.fr/~suter/dags.html) • Applications have between 30 and 50 tasks, generated randomly. • Total number of DAGs in our simulation is 1000. • Workflow Budget: BUDGET = C cheapest + k (CHEFT – Ccheapest) 0<=k<= 1 Lower budget (k=0)  Cheapest scheduling, higher makespan Highest budget (k=1)  shortest makespan (HEFT scheduling) Performance Metric: COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Experimental Result Simulation Platform : • We use SIMGRID that allows a realistic description of the infrastructure parameters. • We consider a bandwidth sharing policy; only one processor can send data over one network link at a time. • We consider nodes of clusters from the GRID’5000platform. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Results Shopia Rennes Grenoble HBCS Time complexity COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Conclusions (2) • Conclusion remarks • We considered a realistic model of the infrastructure; • The HBCS algorithm achieves better performances, in particular for lower budget values (makespan and time complexity); • Future directions • Compare other combinations of cost and time factors in the objective function; • Data locality; • Multiple DAG scheduling. COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Thank you! COST IC804 – IC805 Joint meeting, Tenerife, February 7-8 2013

Jorge G. Barbosa , Altino M. Sampaio , Hamid Harabnejad

Jorge G. Barbosa , Altino M. Sampaio , Hamid Harabnejad

Presentation Transcript

G E M

G E M

M. G.

G E M

G E M

G E M

Hamid Laga hamid@img.cs.titech.ac.jp img.cs.titech.ac.jp/~hamid/

Fernando J. Barbosa barbosa@jlab

Eduardo Barbosa