Motivation

Job Scheduling for the BlueGene/L SystemElie Krevat, Jose G.Castanos, Jose E.MoreiraPresented bySavitha KrishnamoorthyCIS 888The Ohio State University

Motivation Problems associated with toroidal interconnects: • Require rectangular,contiguous job partitions • Introduce fragmentation issues- affect utilization,wait time • Lead to slow down

Toroidal Interconnects • “Endless” connection • Simple, modular, scalable • Examples: Cray T3D, T3E m/c • Problems: • Nodes not fully connected,not equidistant • Spatial location of nodes while allocating jobs - critical • Fragmentation due to rectangular, contiguous partitions

A 2D Torus Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8

Schemes analyzed • Space-sharing scheduling techniques • Backfilling • Moves low priority job (FCFS) ahead • No delay to high priority job • Migration • On-the-fly defragmentation

FCFS Scheduler • Maximize largest free rectangular partition remaining in Torus • Invoked whenever job arrives/ terminates • Rectangles requiring prime number of nodes can’t be found • Simplest Algorithm

FCFS with Backfilling • System utilization  • Estimation of job execution time required • But we know – overestimating execution time doesn’t affect backfilling • Invoked when waiting queue not empty+FCFS scheduler halted

FCFS - With and Without Backfilling

FCFS+Migration • Rearrange running jobs,  contiguous rectangular free partition • Empty torus -> Reschedule • Decision Metrics: • FNtor =Free Nodes:Torus Size • FNmax= Fraction of free nodes in maximal free partition

Backfilling + Migration • Schedule via FCFS first • Rearrange torus through migration, minimize fragmentation • repeat FCFS • Finally backfill

The BG/L System • 32x32x64 3D torus of cells (nodes) • Processor, mem, links to 6 neighbors in each cell • Unit of job allocation 8x8x8 config • Each unit is a supernode • BG/L- a 4x4x8 torus of supernodes

The Simulation Environment • Simulator input: Job log(arrival time,execution time,size of job), type of scheduler (FCFS,B,M,B+M) • 4 Primary events: • Arrival:when job first submitted and placed in scheduler’s waiting queue • Schedule:when job allocated onto torus • Start:Job begins to run(?why 1 second) • Finish:when job completes & is deallocated

Metrics • Torus size N • Arrival time of job j=taj • Execution time = tej • Size of job = sj • Start time = tsj • Finish time = tfj

Parameters • Wait time: twj = tsj – taj • Response time: trj = tfj – taj • Bounded slowdown: • Bound used as some jobs skew slowdown due to very short exe times

Parameters contd… • System Utilization: T is the make span • Total unused capacity: f(t) = free nodes at time t q(t) = total number of nodes requested at t Measure of work unused due to lack of jobs

Parameters contd… • The product T*N – Maximum utilization of the system • Balance of the system capacity, considered lost

Workload characteristics • Experiments performed on 10000-job span of 2 job logs • NASA Ames 128 node iPSC/860 • SDSC 128-node machines

Work load Summary

Size Vs Workload

Wait time Vs Utilization

Mean job slowdown Vs Utilization

Comparing fully connected models

Performing Migration • Recall… • Parameters to determine attempting a migration- FNtor and FNmax • FNtor = Free nodes:Size of Torus • FNmax = Free nodes in maximal free partition:Free nodes • Migration attempted when: • FNtor >= 0.1; FNmax <= 0.7

Migrations Vs Utilization

Average Time B/w Migrations Vs Utilization

Comments…+,- • Compared the schedule when applied fully connected topologies • Studying effect of fragmentation on util,wait time and slowdown • How the schedule affected utilization • Could have given an Average job wait time statistics for each scheduler • Fragmentation important distinction • Could have compared capacity unused, using fully connected system as ideal

Advantage of parameters • Frequency of migration attempts  • Avg benefits of successful migrn  • Comparison of job wait times with: • Scheduler that uses the parameters • Scheduler that always migrates

Mean Job wait time Vs Utilization

Capacity Statistics

POP Algorithm • Projection of Partitions • Solves problem of finding largest free rectangular partition • Exhaustive search M9 for MxMxM Torus • POP is O(M5)

Basic Algorithm • Given a base location from M3, find largest partition first in 1 dimension • Project adjacent dimension, find largest partition in 2D • Projects adjacent 2D planes, find largest partition in 3D

The Algorithm • FREEPART = {<B,S>|B=base location (i,j,k); S=partition size (a,b,c), s.t  x,y,z i<=x<(i+a), j<=y<(j+b), k<=z<(z+c), Node(x%M,y%M,z%M) is free • Largest 1D partitions PFREEPART pre-computed for all 3 Ds in O(M4) time(Every possible base location)

The Algorithm contd…

Future Work

Motivation

Motivation

Presentation Transcript

Motivation

MOTIVATION

Motivation

Motivation

MOTIVATION

Motivation

Motivation

Motivation

Motivation

Motivation

Motivation:

MOTIVATION

Motivation

MOTIVATION

MOTIVATION

Motivation

Motivation

Motivation

Motivation

Motivation:

Motivation

Motivation