1 / 24

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems. Alexandru Iosup , Ozan Sonmez, Shanny Anoep, and Dick Epema. Parallel and Distributed Systems Group, TU Delft. ACM/IEEE Int’l. Symposium on High Performance Distributed Computing.

bob
Download Presentation

The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema Parallel and Distributed Systems Group, TU Delft ACM/IEEE Int’l. Symposium on High Performance Distributed Computing

  2. Natural gas price →$$ for grid computing The VL-e project • A grid project in the Netherlands (2004-) • Natural gas money: VL-e 45 MEuro / 800 MEuro total research package • Overall aim: … to design and build a virtual lab for (digitally) enhanced science (e-science)experiments (no in-vivo or in-vitro, but in-silico experiments). • Goals: • create prototypes of application-specific e-science environments • design and develop re-usable ICT/grid components • validate with real-life applications in testbeds

  3. Grid Services Harness multi-domain distributed resources The VL-e project: application areas Philips IBM Unilever Data Intensive Science Medical Diagnosis & Imaging Bio- Diversity Bio- Informatics Food Informatics Dutch Telescience Virtual Laboratory (VL) Application Oriented Services Management of comm. & computing

  4. Grid Services Harness multi-domain distributed resources The VL-e project: application areas Philips IBM Unilever Bags-of-Tasks Data Intensive Science Medical Diagnosis & Imaging Bio- Diversity Bio- Informatics Food Informatics Dutch Telescience Virtual Laboratory (VL) Application Oriented Services Management of comm. & computing

  5. Grid Services Harness multi-domain distributed resources The VL-e project: application areas Philips IBM Unilever Data Intensive Science Medical Diagnosis & Imaging Bio- Diversity Bio- Informatics Food Informatics Dutch Telescience Bags-of-Tasks Virtual Laboratory (VL) Application Oriented Services Management of comm. & computing

  6. The Challenge • Complete scientific work better, … • User-oriented performance metrics(time a critical performance component) • Bags-of-tasks for ease-of-use • … in real systems • Workloads (now that real traces are available) • Information unavailability • What to do? • Hint: the next 10% improvement won’t cut it!

  7. The Challenge (cont’d.) • System modelWhat is a good model for the study of large-scale distributed computing systems that run bag-of-tasks? • Input modelWhat is a good model for bag-of-tasks workloads in large-scale distributed computing systems? • What is the best setup for such system/input? • How to find the best? • If a best is found, can there be another?

  8. The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems • Introduction and Motivation • Context: System Model • Workload Model • Design Space Exploration • Conclusion

  9. Context: System Model [1/4]Overview • System Model • Clustersexecute jobs • Resource managerscoordinate job execution • Resource management architecturesroute jobs among resource managers • Task selection policiescreate the eligible set • Task scheduling policies:schedule the eligible set

  10. Separated Clusters (sep-c) Centralized (csp) Decentralized (fcondor) Context: System Model [2/4]Resource Management Architecturesroute jobs among resource managers

  11. Context: System Model [3/4]Task Selection Policiescreate the eligible set • Age-based: • S-T: Select Tasks in the order of their arrival. • S-BoT: Select BoTs in the order of their arrival. • User priority based: • S-U-Prio: Select the tasks of the User with the highest Priority. • Based on fairness in resource consumption: • S-U-T: Select the Tasks of the User with the lowest res. cons. • S-U-BoT: Select the BoTs of the User with the lowest res. cons. • S-U-GRR: Select the User Round-Robin/all tasks for this user. • S-U-RR: Select the User Round-Robin/one task for this user.

  12. Task Information K H U ECT, FPLT K ECT-P FPF Resource Information DFPLT,MQD H RR, WQR U STFR Context: System Model [4/4]Task Scheduling Policiesschedule the eligible set • Information availability: • Known • Unknown • Historical records • Sample policies: • Earliest Completion Time (with Prediction of Runtimes) (ECT(-P)) • Fastest Processor First (FPF) • (Dynamic) Fastest Processor Largest Task ((D)FPLT) • Shortest Task First w/ Replication (STFR) • Work Queue w/ Replication (WQR)

  13. The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems • Introduction and Motivation • Context: System Model • Workload Model • Design Space Exploration • Conclusion

  14. Workload Modeling 101: What Matters TimeUnit=100s Longer queues • Job arrival process & job service time: • Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94] • Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure)[IosupJSE EuroPar’07] • Job size: almost always 1CPU [IosupDELW Grid’06] No.Packets/Time Unit TimeUnit=0.01s No.Packets/Time Unit Time Units Time Units

  15. A Bag-of-Tasks Workload Model • Model: • Users, Bags-of-Tasks, Tasks • Heavy-tailed distributions for inter-arrival time, job service time→ can model self-similar workloads • More details (e.g., parameter values): see article • Validation data: the Grid Workloads Archive • 7 long-term grid traces • >5 million tasks • >2500 users • >40k CPUs • Domains: HEP, graphics, AI, math, biomed, climate, finance, aero… http://gwa.ewi.tudelft.nl/

  16. The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems • Introduction and Motivation • Context: System Model • Workload Model • Design Space Exploration • Conclusion

  17. Design Space Exploration [1/5]Overview • Design space exploration: time to understand how our solutions fit into the complete system. • Study the impact of: • The Task Scheduling Policy (s policies) • The Workload Characteristics (P characteristics) • The Dynamic System Information (I levels) • The Task Selection Policy (S policies) • The Resource Management Architecture (A policies) s x 7P x I x S x A x (environment) → >2M design points

  18. Design Space Exploration [2/5]Experimental Setup • Simulator: • DGSim [IosupETFL SC’07, IosupSE EuroPar’08] • System: • DAS + Grid’5000 [Cappello & Bal CCGrid’07] • >3,000 CPUs: relative perf. 1-1.75 • Metrics: • Makespan • Normalized Schedule Length ~ speed-up • Workloads: • Real: DAS + Grid’5000 • Realistic: system load 20-95% (from workload model)

  19. Design Space Exploration [3/5] Selected Results ADesign Guidelines for Scheduling Policies • Influence of the information type: • (K,K): best balance between MS and NSL • (*,U),(U,*): surprisingly good (FPF) to surprisingly poor (WQR4x) • (*,H),(H,*): poor. Simple runtime predictors don’t work (see article) • Where to invest time? • K -> H, K-> U: adapt for information type with lowest variation WQR4x FPF

  20. Design Space Exploration [4/5] Selected Results B Task Selection Only for Busy Systems • Not much difference until system load over 50%. • For DAS + Grid’5000 no change of task selection policy. S-BoT Same performance S-T

  21. Design Space Exploration [5/5] Selected Results C Resource Management Architecture • Centralized, separated, or distributed? • Centralized is best [Note: job overhead not considered.] • Distributed: good for system load below 50%; over 50% it does not finish all tasks.

  22. The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems • Introduction and Motivation • Context: System Model • Workload Model • Design Space Exploration • Conclusion

  23. Task Information K H U ECT, FPLT K ECT-P FPF Resource Information DFPLT,MQD H RR, WQR U STFR Conclusion • System Model = Resource Management Architecture + Task Selection Policy + Task Scheduling Policy • Information availability framework • BoT workload model • Design space exploration: the performance of bags-of-tasks ? Future Work • Better predictors • (H,H) task scheduling policies

  24. Thank you! Questions? Remarks? Observations? • Contact: A.Iosup@gmail.com [google “Iosup“] • Web sites: • http://www.vl-e.nl : VL-e project • http://www.pds.ewi.tudelft.nl : PDS group articles & software Help building the Grid Workloads Archive:http://gwa.ewi.tudelft.nl

More Related