Parallel Computing Systems Part III: Job Scheduling

Parallel Computing SystemsPart III: Job Scheduling Dror Feitelson Hebrew University

Types of Scheduling • Task scheduling • Application is partitioned into tasks • Tasks have precedence constraints • Need to map tasks to processors • Need to consider communications too • Part of creating an application • Job scheduling • Scheduling competing jobs belonging to different users • Part of the operating system

We’ll Focus on Job Scheduling

Dimensions of Scheduling • Space slicing • Partition the machine into disjoint parts • Jobs get exclusive use of a partition • Time slicing • Multitasking on each processor • Similar to conventional systems • Use both together • Use none – batch scheduling on dedicated machine Feitelson, RC 19790 1997

Space Slicing • Fixed: predefined partitions • Used on CM-5 • Variable: carve out number requested • Used on most systems: Paragon, SP, … • Some restrictions may apply, e.g. torus • Adaptive: modify request size according to system considerations • Less nodes if more jobs are present • Dynamic: modify size at runtime too

Time Slicing • Uncoordinated: each PE schedules on its own Local queue: processes allocated to PEs • Requires load balancing Global queue • Provides automatic load sharing • Queue may become a bottleneck • Coordinated across multiple Pes • Explicit gang scheduling • Implicit co-scheduling

Scheduling Framework Arriving jobs Terminating jobs Allocation • Partitioning with run-to-completion • Order of taking jobs from the queue • Re-definition of job size

Scheduling Framework Arriving jobs Terminating jobs Preemption • Time slicing with preemption • Setting time quanta and priorities • May jobs migrate/change size when preempted?

Memory Considerations • The processes of a parallel application typically communicate • To make good progress, they should all run simultaneously • A process that suffers a page fault is unavailable for communication • Paging should therefore be avoided

Scheduling Framework Memory allocation Dispatching • Two stages of scheduling • Or three stages, with swapping

Variable Partitioning

Batch Systems • Define system of queues with different combinations of resource bounds • Schedule FCFS from these queues • Different queues active at prime vs. non-prime time • Sophisticated/complex services provided • Accounting and limits on users/groups • Staging of data in and out of machine • Political prioritization as needed

Example – SDSC Paragon time 16MB 32MB Low priority Wan et al., JSSPP 1996

The Problem • Fragmentation • If the first queued job needs more processors than are available, need to wait for more to be freed • Available processors remain idle during the wait • FCFS (first come first serve) • Short jobs may be stuck behind long jobs in the queue

The Solution • Out of order scheduling • Allows for better packing of jobs • Allows for prioritization according to desired considerations

Backfilling • Allow jobs from the back of the queue to jump over previous jobs • Make reservations for jobs at the head of the queue to prevent starvation • Requires estimates of job runtimes Lifka, JSSPP 1995

Example job3 job1 job4 processors job2 time FCFS

Example job4 job3 job1 processors job2 time Backfiling reservation

Parameters • Order for going over the queue • FCFS • Some prioritized order (Maui) • How many reservations to make • Only one (EASY) • For all skipped jobs (Conservative) • According to need • Lookahead • Consider one job at a time • Look deeper into the queue

EASY Backfilling Extensible Argonne Scheduling System (first large IBM SP installation) • Definitions: • Shadow time: time at which first queued job can run • Extra processors: processors left over when first job runs • Backfill if • Job will terminate by shadow time • Job needs less than extra processors Lifka, JSSPP 1995

First Case job4 job3 processors job2 time shadow time

Second Case extra processors job4 job3 job1 processors job2 time

Properties • Unbounded delay • Backfill jobs will not delay first queued job • But they may delay other queued jobs… Mu’alem & Feitelson, IEEE TPDS 2001

Delay job4 job3 job2 processors job1 time

Delay job4 job3 job2 processors delay job1 time

Properties • Unbounded delay • Backfill jobs will not delay first queued job • But they may delay other queued jobs… • No starvation • Delay of first queued job is bounded by runtime of current jobs • When it runs, the second queued job becomes first • It is then immune of further delays Mu’alem & Feitelson, IEEE TPDS 2001

User Runtime Estimates • Small estimates allow job to backfill and skip the queue • Too short estimates risk the job being killed because it exceeded its time • So estimates may be expected to be accurate

They Aren’t Mu’alem & Feitelson, IEEE TPDS 2001

Surprising Consequence Performance is actually better if runtime estimates are inaccurate! Experiment: replace user estimates by up to f times the actual runtime (Data for KTH)

Exercise Understand why this happens • Run simulations of EASY backfilling with real workloads • Insert instrumentation to record detailed behavior • Try to find why f10 is better than f=1 • Try to find why user estimates are so bad

Hint • It may be beneficial to look at different job classes • Example: EASY vs. Conservative • EASY favors small long jobs: can backfill despite delaying non-first jobs • This comes at expense of larger short jobs • Happens more with user estimates than with accurate estimates

Another Surprise Possible to improve performance by multiplying user estimates by 2! (table shows reduction in %)

The MAUI Scheduler Queue order depends on • Waiting time in queue • Promote equitable service • Fair share status • Political priority • Job parameters • Favor small/large jobs etc. • Number of times skipped by backfill • Prevent starvation • Problem: conflicts are possible, hard to figure out what will happen Jackson et al, JSSPP 2001

Fair Share • Actually unfair: strive for specific share • Based on comparison with historical data • Parameters: • How long to keep information • How to decay old information • Specifying shares for user or group • Shares are upper/lower bound or both • Handling of multiple resources by maximal “PE equivalents” (usage out of total available)

Lookahead • EASY uses a greedy algorithm and considers jobs in one given order • The alternative is to consider a set of jobs at once and try to derive an optimal packing

Dynamic Programming • Outer loop: number of jobs that are being considered • Inner loop: number of processors that are available Achievable utilization on 3 processors using only first 2 jobs Edi Shmueli, IBM Haifa

Cell Update • If j.size > p job is too big to consider uj,p = uj-1,p j is not selected • Else consider adding job j u’ = uj-1,p-j.size + j.size if u’ > uj-1,p then uj,p = u’ j is selected else uj,p = uj-1,p j is not selected

Preventing Starvation • Option I: only use jobs that will terminate by the shadow time • Option II: make a reservation for the first queued job (as in EASY) Requires a 3D data structure: • Jobs being considered • Processors being used now • Extra processors used at the shadow time

Dynamic Programming • In the end the bottom-right cell contains the maximal achievable utilization • The set of jobs to schedule is obtained by the path of selected jobs

Performance • Backfilling leads to significant performance gains relative to FCFS • More reservations reduce performance somewhat (EASY better than conservative) • Lookahead improves performance somewhat

Dynamic Partitioning

Two-Level Scheduling • Bottom level – processor allocation • Done by the system • Balance requests with availability • Can change at runtime • Top level – process scheduling • Done by the application • Use knowledge about priorities, holding locks, etc.

Programming Model • Applications required to handle arbitrary changes in allocated processors • Workpile model • Easy to change number of worker threads • Scheduler activations • Any change causes an upcall into the application, which can reconsider what to run

Equipartitioning • Strive to give all applications equal numbers of processors • When a job arrives take some processors from each running job • When it terminates, give some to each other job • Fair and similar to processor sharing • Caveats • Applications may have a maximal number of processors they can use efficiently • Applications may need a minimal number of processors due to memory constraints • Reconfigurations require many process migrations Not an issue for shared memory

Folding • Reduce processor preemptions by selecting a partition and dividing it in half • All partition sizes are powers of 2 • Easier for applications: when halved, multitask two processes on each processor McCann & Zahorjan, SIGMETRICS 1994

The Bottom Line • Places restrictions on programming model • OK for workpile, Cray autotasking • Not suitable for MPI • Very efficient at the system level • No fragmentation • Load leads to smaller partitions and reduced overheads for parallelism • Of academic interest only, in shared memory architectures

Gang Scheduling

Definition • Processes are mapped one-to-one on processors • Time slicing is used for multiprogramming • Context switching is coordinated across processors • All processes are switched at the same time • Either all run or none do • This applies to gangs, typically all processes in a job

CoScheduling • Variant in which an attempt is made to schedule all the processes, but subsets may also be scheduled • Assumes “process working set” that should run together to make progress • Does this make sense? • All processes are active entities • Are some more important than others? Ousterhout, ICDCS 1982

Parallel Computing Systems Part III: Job Scheduling

Parallel Computing Systems Part III: Job Scheduling

Presentation Transcript

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Parallel Computing

Parallel Computing

A Parallel Genetic Algorithm FOR Predictive Job Scheduling

Parallel Computing

Operating Systems Lecture 17 Scheduling III

Parallel Computing

Power-Aware Parallel Job Scheduling

Parallel Computing

Parallel Computing

Government Systems Part III

Parallel Computing Systems Part I: Introduction

Job Scheduling for Grid Computing on Metacomputers

Parallel Job Scheduling Algorithms and Interfaces

MPI Part III NPACI Parallel Computing Institute August 19 - 23, 2002

Job Scheduling

CHAPTER 2 PROCESSOR SCHEDULING PART III

Scheduling on Parallel Systems

Job Scheduling

Job scheduling