1 / 37

Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects. Toward Realistic Evaluation of Job Scheduling Strategies. Eitan Frachtenberg With Dror Feitelson, Fabrizio Petrini, and Juan Fernandez. Computer and Computational Sciences Division Los Alamos National Laboratory.

arden-hess
Download Presentation

Designing Parallel Operating Systems using Modern Interconnects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing Parallel Operating Systems using Modern Interconnects Toward Realistic Evaluation of Job Scheduling Strategies Eitan Frachtenberg With Dror Feitelson, Fabrizio Petrini, and Juan Fernandez Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world

  2. Outline • The challenges of parallel job scheduling Evaluation • Emulation: rationale, strengths, and weaknesses • Experimental results and analysis • How do different algorithms react to increasing load? • Can knowing the future help? • What is the effect of multiprogramming? • What applications is it good for?

  3. Parallel Job Scheduling • The task: assign compute resources to parallel jobs • The computers (clusters and MPPs): • Range from 100s of processors to 10,000 and more • Typically homogenous and connected by a fast interconnect • Jobs arrive dynamically, with different sizes and runtimes, requiring online scheduling • Mostly fine-grained communication, lots of memory • Mix of serial, parallel, short and long jobs

  4. Scheduling Taxonomy • “Rectangle packing” • Main dimensions: space sharing and time sharing • Additional queue dimension: backfilling, priorities

  5. Backfilling • Backfilling is a technique to move jobs forward in queue • Requires advanced knowledge of run time (or reservation) • Reduces external fragmentation and improves utilization, responsiveness and throughput • Has several variations

  6. Time Sharing • Does not require reservation times, but can be combined with backfilling • Higher reduction of external fragmentation, possibly even internal fragmentation, resulting in improved utilization, responsiveness and throughput • But also challenging: • Memory pressure • Context-switch overheads • Process synchronization tradeoffs: • Tightly-coupled processes must be coscheduled • Coordination can incur overhead and fragmentation

  7. none implicit hybrid explicitLocal DCS ICS/SB CC PB FCS BCS GS Time Sharing Spectrum • No coordination: local UNIX scheduling • Explicit coordination: • Global clock (centralized) • Global context-switches to known job • Implicit coordination: infer synchronization information at server side, receiver side, or both • Hybrid: global coordination with local autonomy coordination

  8. Without Timesharing • Short processes wait for long periods in the queue • External fragmentation creates many “holes”

  9. Time Sharing - GS • Gang Scheduling multiprograms several jobs • Reduces response time and “fills holes” • Incurs more overhead and memory pressure (time quantum)

  10. Time Sharing - SB • Spin Block (ICS) is a sender side coordination heuristic • Reduces overhead, increases scalability • Performs poorly with fine-grained communication

  11. Time Sharing - FCS • Combine global synchronization & local information • Rely on scalable primitives for global coordination and information exchange • Measure communication characteristics, such as granularity and wait times • Classify processes based on synchronization requirements • Schedule processes based on class • Preferential to short jobs

  12. FCS Classification Fine Coarse Granularity DC Locally scheduled Long Short Block times CS Always gang-scheduled F Preferably gang-scheduled

  13. Evaluation Challenges • Theoretical Analysis (queuing theory): • Not applicable to time-sharing due to unknown parameters, application structure, and feedbacks • Simulation: • Many assumptions, not all known/reported • Hard to reproduce: many studies provide contradicting results, often showing theirs is “best” • Rarely factors application characteristics • Experiments with real sites and workloads: • Largely impractical and irreproducible • Emulation

  14. Emulation Methodology • Framework for studying scheduling algorithms • Runs any MPI application in a cluster • Implemented several scheduling algorithms • Allows control over input parameters • Provides detailed logs and analysis tools • Testing in a repeatable dynamic environment • Dynamic job arrivals, with varying time and space requirements • Complex, longer and more realistic workloads

  15. Evaluation by Emulation Pros: • Real: no hidden assumptions or overheads • Configurable: choice of parameters and workloads • Repeatable: same experiment, same results • Portable: allows the isolation of HW factors Cons: • Slow • Requires more resources than analysis/simulation • GIGO – results are only as representative as input

  16. Experimental Environment • Implemented on top of STORM, a scalable resource management system for clusters • Algorithms: FCFS, GS, SB, FCS, using backfilling • MPI synthetic (BSP) and LANL applications • Different granularities and communication patterns • Flexible workload model, 1000 jobs • Time shrinking • Three clusters, using QsNet: • Pentium III 32x2 1GB/node • Itanium II 32x2 2GB/node • Alpha EV6 64x4 8GB/node

  17. Experiments Overview • Use synthetic applications for basic insights • Effect of multiprogramming level • Effect of backfilling • Effect of time quantum • Effect of load • Use LANL’s Sage/Sweep3D for application study • Caveat emptor: • Only LANL applications • Does not follow input workload closely • Limited set of inputs • Different architecture (Alpha)

  18. Effect of MPL • Questions: • What is the effect of preemptive multiprogramming compared to FCFS (batch) scheduling? • Higher MPL = higher performance? Parameters: • GS with MPL values 16 (1==batch) • Input load: ~75% • Bounded slowdown, cutoff at 10s

  19. MPL – Response Time

  20. MPL – Bounded Slowdown

  21. Effect of Backfilling • Adding backfilling (“the future”) to GS/Batch helps?

  22. Backfilling – Response time • Backfilling helps short jobs, harms long jobs

  23. Effect of Time Quantum • Shorter time quantum pros: • System more responsive • Less external fragmentation • Longer time quantum pros: • Less cache/memory pressure • Less synchronization overhead Setup: • GS at ~75% load • Compare Pentium III to Itanium II

  24. Time Quantum – Response Time

  25. Effect of Load • Comparing FCFS, GS, SB and FCS with backfilling • Varying offered load by increasing run times • Load values: ~40%  90% • No measurements after saturation point

  26. Load – Response Time

  27. Load - Bounded Slowdown

  28. Response Time Median

  29. Bounded Slowdown Median

  30. 500 Shortest jobs CDF

  31. 500 Longest jobs CDF

  32. Scientific Applications • Sage and Sweep3D • Hydrodynamics codes • approx. 50%-80% of LANL cycles • Memory-constrained • Mostly operating out of cache • Relatively load-balanced • Parameters: • MPL 2, 100ms time quantum • 1000 jobs, modeled arrival times, random run times • Realistic inputs, biased toward short runs

  33. Response Time

  34. Bounded Slowdown

  35. Conclusions - methodology • A more realistic evaluation of job scheduling • Repeatable experiments • Allows isolation of factors • Direct comparison of platforms on the applications you care most about

  36. Conclusions - experiments • Significant improvement over FCFS can be achieved with multiprogramming, even MPL=2 • Backfilling can also make a difference • Batch programming discriminates against short jobs • Multiprogramming for scientific apps pays off, even with MPL 2 • FCS can outperform explicit/implicit coscheduling For more information: eitanf@lanl.gov

  37. Time Quantum – Slowdown

More Related