250 likes | 394 Views
Overprovisioning for Performance Consistency in Grids. Nezih Yigitbasi and Dick Epema. P arallel and Distributed Systems Group Delft University of Technology. http://guardg.st.ewi.tudelft.nl/. The Problem: Performance inconsistency in grids. Inconsistent performance common in grids
E N D
Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi andDick Epema Parallel and Distributed Systems Group Delft University of Technology http://guardg.st.ewi.tudelft.nl/
The Problem: Performance inconsistency in grids • Inconsistent performance common in grids • bursty workloads • variable background loads • high rate of failures • highly dynamic & heterogeneous environment How can we provide consistent performance in grids? Bag-of-Tasks with 128 tasks submitted every 15 minutes ~70X
Our goals GOAL-1 Realistic performance evaluation of static and dynamic overprovisioning strategies (system’s perspective) GOAL-2 Dynamically determine the overprovisioning factor (Κ) for user specified performance requirements (user’s perspective)
Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions
Overprovisioning (I) • Increasing the system capacity to provide better, and in particular, consistentperformance even under variable workloads and unexpected demands Pros • simple • obviates the need for complex algorithms • easy to deploy & maintain Cons • cost-ineffective • workloads may evolve (e.g., increasing user base) • lowly-utilized systems
Overprovisioning (II) • High overprovisioning factors (Κ) are common in modern systems • Google: 450,000 (2005) • Microsoft: 218,000 (mid-2008) • Facebook: 10,000+ (2009) • Preferred way of providing performance guarantees • typical data center utilization is no more than 15-50% • telecommunication systems have ~30% on average L. A. Barroso and U. Hölzle, The Case for Energy-Proportional Computing, IEEE Computer, December 2007.
Overprovisioning strategies Dynamic Static Waste 1. Static • Largest • All • Number • Where should we deploy the resources? • Does it make any difference? 2. Dynamic • Dynamic overprovisioning • a.k.a. auto-scaling • low/high thresholds for acquiring/releasing resources • Given Κ, it is straightforward to determine the number of processors for a strategy Demand Capacity Time
Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions
System model global queue local queues GRM • DAS-3 multi-cluster grid • Global Resource Managers (GRM) interacting with Local Resource Managers (LRM) LRM LRM LRM global job local jobs
Workload • Realistic workloads consisting of Bag-of-Tasks (BoT) • Simulations using 10 workloads with 80% load • each workload has ~1650 BoTs and ~10K tasks • duration of each workload is [1 day-1week] • Real background load trace • DAS-3 trace of June’08 (http://gwa.ewi.tudelft.nl/) (Distribution parameters are determined after base-two log transformation)
Scheduling model • We consider the following BoT scheduling policies • Static Scheduling • statically partitions tasks across clusters • Dynamic Scheduling • takes cluster load into account • Dynamic Per Task Scheduling • Dynamic Per BoT Scheduling • Prediction-based Scheduling • average of the last two runtimes for prediction • sends the task to the cluster which is predicted to lead to the earliest completion time (ECT)
Methodology • Compare the overprovisioned system with the initial system (NO) • For Dynamic • 69/129 s and 18/23 s for min/max acquisition/release • 60%/70% for low/high thresholds • Κvaries over time so for a fair comparison keep it in ± 10% range
Traditional performance metrics Makespan of a BoT Difference between the earliest time of submission of any of its tasks, and the latest time of completion of any of its tasks Normalized Schedule Length (NSL) of a BoT Ratio of its makespan to the sum of the runtimes of its tasks on a reference processor (slowdown) Makespan First task submitted Last task done
Consistency metrics • We define two metrics to capture the notion of consistency across two dimensions • System gets more consistent as Cd gets closer to 1, Cs gets closer to 0 • A tighter range of the NSL is a sign of better consistency
Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions
Performance of scheduling policies Dynamic Per Task is the best ECT is the worst
Performance of different strategies Different Strategies Different Overprovisioning Factors (Κ) • Consistency obtained with overprovisioning is much better than the initial system (NO) • Static strategies provide similar performance (only K matters) • All and Largest are viable alternatives to Number as Number increases the administration, installation, and maintenance costs • Dynamic strategy has better performance compared to static strategies • K= 2.5 is the critical value
Cost of different strategies • Use CPU-Hours • time a processor is used [h] • round up a partial instance-hours to one hour similar to the Amazon EC2 on-demand instances pricing model • Significant reduction, as high as ~40%, in cost
Outline Overprovisioning Strategies Experimental Setup Results Dynamically Determining Κ Conclusions
Determining Κ dynamically • So far system’s perspective, now user’s perspective • How can we dynamically determine Κ given the user performance requirements? • We use a simple feedback-control approach to deploy additional resources dynamically to meet user performance requirements
Evaluation • Simulated DAS-3 without background load • ~1.5 month workload consisting of ~33K BoTs • Empirically show that the controller stabilizes • Average makespan for the workload in the initial system (without the controller) is ~3120 minutes • Three scenarios from tight to loose performance requirements • [250m-300m] • [700m-750m] • [1000m-1250m]
Results (I) • Significant improvement, as high as ~65%, when the performance requirements are tight • ~40%-50% improvement for loose performance requirements
Results (II) [700m-750m] [250m-300m] [1000m-1250m]
Conclusions GOAL-1: Realistic Performance Evaluation of Different Strategies • Overprovisioning improves performance consistency significantly • Static strategies provide similar performance (only K matters) • Dynamic strategy performs better than the static strategies • Need to determine the critical value to maximize the benefit of overprovisioning GOAL-2: Dynamically Determining Κ for Given User Performance Requirements • Feedback-controlled system tuning K dynamically using historical performance data and specified performance requirements • The number of BoTs meeting the performance requirements increases significantly, as high as 65%, compared to the initial system
Thank you! Questions? Comments? “M.N.Yigitbasi@tudelft.nl” http://www.st.ewi.tudelft.nl/~nezih/ • More Information: • Guard-g Project: http://guardg.st.ewi.tudelft.nl/ • PDS publication database: http://www.pds.twi.tudelft.nl