Designing Parallel Operating Systems using Modern Interconnects

Designing Parallel Operating Systems using Modern Interconnects Toward Realistic Evaluation of Job Scheduling Strategies Eitan Frachtenberg With Dror Feitelson, Fabrizio Petrini, and Juan Fernandez Computer and Computational Sciences Division Los Alamos National Laboratory Ideas that change the world

Outline • The challenges of parallel job scheduling Evaluation • Emulation: rationale, strengths, and weaknesses • Experimental results and analysis • How do different algorithms react to increasing load? • Can knowing the future help? • What is the effect of multiprogramming? • What applications is it good for?

Parallel Job Scheduling • The task: assign compute resources to parallel jobs • The computers (clusters and MPPs): • Range from 100s of processors to 10,000 and more • Typically homogenous and connected by a fast interconnect • Jobs arrive dynamically, with different sizes and runtimes, requiring online scheduling • Mostly fine-grained communication, lots of memory • Mix of serial, parallel, short and long jobs

Scheduling Taxonomy • “Rectangle packing” • Main dimensions: space sharing and time sharing • Additional queue dimension: backfilling, priorities

Backfilling • Backfilling is a technique to move jobs forward in queue • Requires advanced knowledge of run time (or reservation) • Reduces external fragmentation and improves utilization, responsiveness and throughput • Has several variations

Time Sharing • Does not require reservation times, but can be combined with backfilling • Higher reduction of external fragmentation, possibly even internal fragmentation, resulting in improved utilization, responsiveness and throughput • But also challenging: • Memory pressure • Context-switch overheads • Process synchronization tradeoffs: • Tightly-coupled processes must be coscheduled • Coordination can incur overhead and fragmentation

none implicit hybrid explicitLocal DCS ICS/SB CC PB FCS BCS GS Time Sharing Spectrum • No coordination: local UNIX scheduling • Explicit coordination: • Global clock (centralized) • Global context-switches to known job • Implicit coordination: infer synchronization information at server side, receiver side, or both • Hybrid: global coordination with local autonomy coordination

Without Timesharing • Short processes wait for long periods in the queue • External fragmentation creates many “holes”

Time Sharing - GS • Gang Scheduling multiprograms several jobs • Reduces response time and “fills holes” • Incurs more overhead and memory pressure (time quantum)

Time Sharing - SB • Spin Block (ICS) is a sender side coordination heuristic • Reduces overhead, increases scalability • Performs poorly with fine-grained communication

Time Sharing - FCS • Combine global synchronization & local information • Rely on scalable primitives for global coordination and information exchange • Measure communication characteristics, such as granularity and wait times • Classify processes based on synchronization requirements • Schedule processes based on class • Preferential to short jobs

FCS Classification Fine Coarse Granularity DC Locally scheduled Long Short Block times CS Always gang-scheduled F Preferably gang-scheduled

Evaluation Challenges • Theoretical Analysis (queuing theory): • Not applicable to time-sharing due to unknown parameters, application structure, and feedbacks • Simulation: • Many assumptions, not all known/reported • Hard to reproduce: many studies provide contradicting results, often showing theirs is “best” • Rarely factors application characteristics • Experiments with real sites and workloads: • Largely impractical and irreproducible • Emulation

Emulation Methodology • Framework for studying scheduling algorithms • Runs any MPI application in a cluster • Implemented several scheduling algorithms • Allows control over input parameters • Provides detailed logs and analysis tools • Testing in a repeatable dynamic environment • Dynamic job arrivals, with varying time and space requirements • Complex, longer and more realistic workloads

Evaluation by Emulation Pros: • Real: no hidden assumptions or overheads • Configurable: choice of parameters and workloads • Repeatable: same experiment, same results • Portable: allows the isolation of HW factors Cons: • Slow • Requires more resources than analysis/simulation • GIGO – results are only as representative as input

Experimental Environment • Implemented on top of STORM, a scalable resource management system for clusters • Algorithms: FCFS, GS, SB, FCS, using backfilling • MPI synthetic (BSP) and LANL applications • Different granularities and communication patterns • Flexible workload model, 1000 jobs • Time shrinking • Three clusters, using QsNet: • Pentium III 32x2 1GB/node • Itanium II 32x2 2GB/node • Alpha EV6 64x4 8GB/node

Experiments Overview • Use synthetic applications for basic insights • Effect of multiprogramming level • Effect of backfilling • Effect of time quantum • Effect of load • Use LANL’s Sage/Sweep3D for application study • Caveat emptor: • Only LANL applications • Does not follow input workload closely • Limited set of inputs • Different architecture (Alpha)

Effect of MPL • Questions: • What is the effect of preemptive multiprogramming compared to FCFS (batch) scheduling? • Higher MPL = higher performance? Parameters: • GS with MPL values 16 (1==batch) • Input load: ~75% • Bounded slowdown, cutoff at 10s

MPL – Response Time

MPL – Bounded Slowdown

Effect of Backfilling • Adding backfilling (“the future”) to GS/Batch helps?

Backfilling – Response time • Backfilling helps short jobs, harms long jobs

Effect of Time Quantum • Shorter time quantum pros: • System more responsive • Less external fragmentation • Longer time quantum pros: • Less cache/memory pressure • Less synchronization overhead Setup: • GS at ~75% load • Compare Pentium III to Itanium II

Time Quantum – Response Time

Effect of Load • Comparing FCFS, GS, SB and FCS with backfilling • Varying offered load by increasing run times • Load values: ~40%  90% • No measurements after saturation point

Load – Response Time

Load - Bounded Slowdown

Response Time Median

Bounded Slowdown Median

500 Shortest jobs CDF

500 Longest jobs CDF

Scientific Applications • Sage and Sweep3D • Hydrodynamics codes • approx. 50%-80% of LANL cycles • Memory-constrained • Mostly operating out of cache • Relatively load-balanced • Parameters: • MPL 2, 100ms time quantum • 1000 jobs, modeled arrival times, random run times • Realistic inputs, biased toward short runs

Response Time

Bounded Slowdown

Conclusions - methodology • A more realistic evaluation of job scheduling • Repeatable experiments • Allows isolation of factors • Direct comparison of platforms on the applications you care most about

Conclusions - experiments • Significant improvement over FCFS can be achieved with multiprogramming, even MPL=2 • Backfilling can also make a difference • Batch programming discriminates against short jobs • Multiprogramming for scientific apps pays off, even with MPL 2 • FCS can outperform explicit/implicit coscheduling For more information: eitanf@lanl.gov

Time Quantum – Slowdown

Designing Parallel Operating Systems using Modern Interconnects