Overprovisioning for Performance Consistency in Grids

Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi andDick Epema Parallel and Distributed Systems Group Delft University of Technology http://guardg.st.ewi.tudelft.nl/

The Problem: Performance inconsistency in grids • Inconsistent performance common in grids • bursty workloads • variable background loads • high rate of failures • highly dynamic & heterogeneous environment How can we provide consistent performance in grids? Bag-of-Tasks with 128 tasks submitted every 15 minutes ~70X

Our goals GOAL-1 Realistic performance evaluation of static and dynamic overprovisioning strategies (system’s perspective) GOAL-2 Dynamically determine the overprovisioning factor (Κ) for user specified performance requirements (user’s perspective)

Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions

Overprovisioning (I) • Increasing the system capacity to provide better, and in particular, consistentperformance even under variable workloads and unexpected demands Pros • simple • obviates the need for complex algorithms • easy to deploy & maintain Cons • cost-ineffective • workloads may evolve (e.g., increasing user base) • lowly-utilized systems

Overprovisioning (II) • High overprovisioning factors (Κ) are common in modern systems • Google: 450,000 (2005) • Microsoft: 218,000 (mid-2008) • Facebook: 10,000+ (2009) • Preferred way of providing performance guarantees • typical data center utilization is no more than 15-50% • telecommunication systems have ~30% on average L. A. Barroso and U. Hölzle, The Case for Energy-Proportional Computing, IEEE Computer, December 2007.

Overprovisioning strategies Dynamic Static Waste 1. Static • Largest • All • Number • Where should we deploy the resources? • Does it make any difference? 2. Dynamic • Dynamic overprovisioning • a.k.a. auto-scaling • low/high thresholds for acquiring/releasing resources • Given Κ, it is straightforward to determine the number of processors for a strategy Demand Capacity Time

System model global queue local queues GRM • DAS-3 multi-cluster grid • Global Resource Managers (GRM) interacting with Local Resource Managers (LRM) LRM LRM LRM global job local jobs

Workload • Realistic workloads consisting of Bag-of-Tasks (BoT) • Simulations using 10 workloads with 80% load • each workload has ~1650 BoTs and ~10K tasks • duration of each workload is [1 day-1week] • Real background load trace • DAS-3 trace of June’08 (http://gwa.ewi.tudelft.nl/) (Distribution parameters are determined after base-two log transformation)

Scheduling model • We consider the following BoT scheduling policies • Static Scheduling • statically partitions tasks across clusters • Dynamic Scheduling • takes cluster load into account • Dynamic Per Task Scheduling • Dynamic Per BoT Scheduling • Prediction-based Scheduling • average of the last two runtimes for prediction • sends the task to the cluster which is predicted to lead to the earliest completion time (ECT)

Methodology • Compare the overprovisioned system with the initial system (NO) • For Dynamic • 69/129 s and 18/23 s for min/max acquisition/release • 60%/70% for low/high thresholds • Κvaries over time so for a fair comparison keep it in ± 10% range

Traditional performance metrics Makespan of a BoT Difference between the earliest time of submission of any of its tasks, and the latest time of completion of any of its tasks Normalized Schedule Length (NSL) of a BoT Ratio of its makespan to the sum of the runtimes of its tasks on a reference processor (slowdown) Makespan First task submitted Last task done

Consistency metrics • We define two metrics to capture the notion of consistency across two dimensions • System gets more consistent as Cd gets closer to 1, Cs gets closer to 0 • A tighter range of the NSL is a sign of better consistency

Performance of scheduling policies Dynamic Per Task is the best ECT is the worst

Performance of different strategies Different Strategies Different Overprovisioning Factors (Κ) • Consistency obtained with overprovisioning is much better than the initial system (NO) • Static strategies provide similar performance (only K matters) • All and Largest are viable alternatives to Number as Number increases the administration, installation, and maintenance costs • Dynamic strategy has better performance compared to static strategies • K= 2.5 is the critical value

Cost of different strategies • Use CPU-Hours • time a processor is used [h] • round up a partial instance-hours to one hour similar to the Amazon EC2 on-demand instances pricing model • Significant reduction, as high as ~40%, in cost

Determining Κ dynamically • So far system’s perspective, now user’s perspective • How can we dynamically determine Κ given the user performance requirements? • We use a simple feedback-control approach to deploy additional resources dynamically to meet user performance requirements

Evaluation • Simulated DAS-3 without background load • ~1.5 month workload consisting of ~33K BoTs • Empirically show that the controller stabilizes • Average makespan for the workload in the initial system (without the controller) is ~3120 minutes • Three scenarios from tight to loose performance requirements • [250m-300m] • [700m-750m] • [1000m-1250m]

Results (I) • Significant improvement, as high as ~65%, when the performance requirements are tight • ~40%-50% improvement for loose performance requirements

Results (II) [700m-750m] [250m-300m] [1000m-1250m]

Conclusions GOAL-1: Realistic Performance Evaluation of Different Strategies • Overprovisioning improves performance consistency significantly • Static strategies provide similar performance (only K matters) • Dynamic strategy performs better than the static strategies • Need to determine the critical value to maximize the benefit of overprovisioning GOAL-2: Dynamically Determining Κ for Given User Performance Requirements • Feedback-controlled system tuning K dynamically using historical performance data and specified performance requirements • The number of BoTs meeting the performance requirements increases significantly, as high as 65%, compared to the initial system

Thank you! Questions? Comments? “M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/ • More Information: • Guard-g Project: http://guardg.st.ewi.tudelft.nl/ • PDS publication database: http://www.pds.twi.tudelft.nl

Overprovisioning for Performance Consistency in Grids

Overprovisioning for Performance Consistency in Grids

Presentation Transcript

Monitoring for Technical Performance, Accountability, and Consistency

Consistency in Sentencing

Tracking Grids and School Performance

Consistency

High Performance Network Monitoring Challenges for Grids

Scalability in Grids

Veridata for Data Consistency in WLCG

Grids in CYFRONET

Towards high-performance communication layers for JXTA on grids

Photochemical Model Performance and Consistency

High-Performance Transport Protocols for Data-Intensive World-Wide Grids

Consistency

Grid.it Enabling Platforms for High-Performance Computational Grids

Authorisation in Grids

High Performance Workflows for Networks and Grids

Pegasus: Planning for Execution in Grids

Pegasus: Planning for Execution in Grids

Consistency

Consistency

Consistency

BPEL in Grids

Multiprocessors— Performance, Synchronization, Memory Consistency Models