490 likes | 770 Views
Fair Share Scheduling . Ethan Bolker Mathematics & Computer Science UMass Boston eb@cs.umb.edu www.cs.umb.edu/~eb Queen’s University March 23, 2001. Acknowledgements. Yiping Ding Jeff Buzen Dan Keefe Oliver Chen Chris Thornley. Aaron Ball Tom Larard Anatoliy Rikun Liying Song.
E N D
Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston eb@cs.umb.edu www.cs.umb.edu/~eb Queen’s University March 23, 2001
Acknowledgements • Yiping Ding • Jeff Buzen • Dan Keefe • Oliver Chen • Chris Thornley • Aaron Ball • Tom Larard • Anatoliy Rikun • Liying Song References • www.bmc.com/patrol/fairshare • www/cs.umb.edu/~eb/goalmode
Coming Attractions • Queueing theory primer • Fair share semantics • Priority scheduling; conservation laws • Predicting response times from shares • analytic formula • experimental validation • applet simulation • Implementation geometry
Transaction Workload • Stream of jobs visiting a server (ATM, time shared CPU, printer, …) • Jobs queue when server is busy • Input: • Arrival rate: job/sec • Service demand: s sec/job • Performance metrics: • server utilization: u = s (must be 1) • response time: r = ??? sec/job (average) • degradation: d = r/s
Response time computations • r, d measure queueing delay r s (d 1), unless parallel processing possible • Randomness really matters r = s (d = 1) if arrivals scheduled (best case, no waiting) r >> s for bulk arrivals (worst case, maximum delays) • Theorem. If arrivals are Poisson and service is exponentially distributed (M/M/1) then d = 1/(1- u) r = s/(1- u) • Think: virtual server with speed 1-u
M/M/1 • Essential nonlinearity often counterintuitive • at u = 95% average degradation is 1/(1-0.95) = 20, • but 1 customer in 20 has no wait at all (5% idle time) • A useful guide even when hypotheses fail • accurate enough ( 30%) for real computer systems • d depends only on u: many small jobs have same impact as few large jobs • faster system smaller s smaller u r = s/(1-u) double win: less service, less wait • waiting costly, server cheap (telephones): want u 0 • server costly (doctors): want u 1 but scheduled
Scheduling for Performance • Customers want good response times • Decreasing u is expensive • High end Unix offerings from HP, IBM, Sun offer fair share scheduling packages that allow an administrator to allocate scarce resources (CPU, processes, bandwidth) among workloads • How do these packages behave? • Model as a black box, independent of internals • Limit study to CPU shares on a uniprocessor
Multiple Job Streams • Multiple workloads, utilizations u1, u2, … • U = ui < 1 • If no workload prioritization then all degradations are equal: di = 1/(1-U) • Share allocations are de facto prioritizations • Study degradation vector V = (d1, d2, …)
Share Semantics • Suppose workload w has CPU share fw • Normalize shares so that w fw = 1 • w gets fraction fw of CPU time slices when at least one of its jobs is ready for service • Can it use more if competing workloads idle? No : think share = cap Yes : think share = guarantee
Shares As Caps • Good for accounting (sell fraction of web server) • Available now from IBM, HP, soon from Sun • Straightforward (boring) - workloads are isolated • Each runs on a virtual processor with speed *= f share f dedicated system utilization u u/f need f > u ! response time r r(1 u)/(f u)
Shares As Guarantees • Good for performance + economy (use otherwise idle resources) • Shares make a difference only when there are multiple workloads • Large share resembles high priority: share may be less than utilization • Workload interaction is subtle, often unintuitive, hard to explain
OS Performance Goals response time report measure frequently update query workload Model complex scheduling software analytic algorithms fast computation Modeling
Modeling • Real system • Complex, dynamic, frequent state changes • Hard to tease out cause and effect • Model • Static snapshot, deals in averages and probabilities • Fast enlightening answers to “what if ” questions • Abstraction helps you understand real system • Start with a study of priority scheduling
Priority Scheduling • Priority state: order workloads by priority (ties OK) • two workloads, 3 states: 12, 21, [12] • three workloads, 13 states: • 123 (6 = 3! of these ordered states), • [12]3 (3 of these), • 1[23] (3 of these), • [123] (1 state with no priorities) • n wkls, f(n) states, n! ordered (simplex lock combos) • p(s) = prob( state = s ) = fraction of time in state s • V(s) = degradation vector when state = s (measure this, or compute it using queueing theory) • V = s p(s)V(s) (time avg is convex combination) • Achievable region is convex hull of vectors V(s)
Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) achievable region V(21) d1
Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) 0.5 V(12) + 0.5V(21) V([12]) V(21) d1
Two workloads d1 = d2 d2 V(12) (wkl 1 high prio) V([12]) (no priorities) note: u1 < u2 wkl 2 effect on wkl 1 large V(21) d1
Conservation • No Free Lunch Theorem. Weighted average degradation is constant, independent of priority scheduling scheme: i (ui /U) di = 1/(1-U) • Provable from some hypotheses • Observable in some real systems • Sometimes false: shortest job first minimizes average response time (printer queues, supermarket express checkout lines)
Conservation • For any proper set A of workloads Imagine giving those workloads top priority. Then can pretend other wkls don’t exist. In that case i A (ui /U(A)) di= 1/(1-U(A)) When wkls in A have lower priorities they have higher degradations, so in general i A (ui /U(A)) di 1/(1-U(A)) • These 2n -2 linear inequalities determine the convex achievable regionR • R is a permutahedron: only n! vertices
Two Workloads conservation law: (d1, d2 ) lies on the line d2 : workload 2 degradation u1d1 + u2d2 = 1/(1-U) d1 : workload 1 degradation
Two Workloads d2 : workload 2 degradation constraint resulting from workload 1 d1 1/(1- u1 ) d1 : workload 1 degradation
Two Workloads Workload 1 runs at high priority: V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) ) d2 : workload 2 degradation constraint resulting from workload 1 d1 1 /(1- u1 ) d1 : workload 1 degradation
Two Workloads d2 : workload 2 degradation u1d1 + u2d2 = 1/(1-U) d2 1 /(1- u2 ) V(2,1) d1 : workload 1 degradation
Two Workloads V(1,2) = (1 /(1- u1 ), 1 /(1- u1 )(1-U) ) d2 : workload 2 degradation achievable region R u1d1 + u2d2 = 1/(1-U) d2 1 /(1- u2 ) V(2,1) d1 1 /(1- u1 ) d1 : workload 1 degradation
Three Workloads • Degradation vector (d1,d2, d3) lies on plane u1 d1 + u2 d2 + u3dr3 = C • We know a constraint for each workload w: uw dw Cw • Conservation applies to each pair of wkls as well: u1 d1 + u2 d2 C12 • Achievable region has one vertex for each priority ordering of workloads: 3! = 6 in all • Hence its name: the permutahedron
Three Workload Permutahedron 3! = 6 vertices (priority orders) 23 - 2 = 6 edges (conservation constraints) u1 r1 + u2 d2 + u3 d3 = C d3 V(1,2,3) V(2,1,3) d2 d1
Four workload permutahedron 4! = 24 vertices (ordered states) 24 - 2 = 14 facets (proper subsets) (conservation constraints) 74 faces (states) Simplicial geometry and transportation polytopes, Trans. Amer. Math. Soc. 217 (1976) 138.
Map shares to degradations- two workloads - • Suppose f1 and f2 > 0 , f1 + f2 = 1 • Model: System operates in state • 12 with probability f1 • 21 with probability f2 (independent of who is on queue) • Average degradation vector: V = f1 V(12) + f2 V(21)
Predict Degradations From Shares(Two Workloads) • Reasonable modeling assumption: f1 = 1, f2 = 0 means workload 1 runs at high priority • For arbitrary shares: workload priority order is (1,2) with probability f1 (2,1) with probability f2 (probability = fraction of time) • Compute average workload degradation: d1 = f1 (wkl 1 degradation at high priority) + f2 (wkl 1 degradation at low priority ) Fair Share Scheduling
Map shares to degradations- three (n) workloads - f1 f2 f3 prob(123) = ------------------------------ (f1 + f2 +f3) (f2 +f3) (f3) • Theorem: These n! probabilities sum to 1 • interesting identity generalizing adding fractions • prove by induction, or by coupon collecting • V = ordered states s prob(s) V(s) • O(n!), (n!), good enough for n 9 (12)
The Fair Share Applet • Screen captures on next slides are from www.bmc.com/patrol/fairshare • Experiment with “what if” fair share modeling • Watch a simulation • Random virtual job generator for the simulation is the same one used to generate random real jobs for our benchmark studies
1 2 3 Three Transaction Workloads ??? • Three workloads, each with utilization 0.32 jobs/second 1.0 seconds/job = 0.32 = 32% • CPU 96% busy, so average (conserved) response time is 1.0/(10.96) = 25 seconds • Individual workload average response times depend on shares ??? ???
1 sum 80.0 32.0 2 48.0 3 20.0 Three Transaction Workloads • Normalized f3 = 0.20 means 20% of the time workload 3 (development) would be dispatched at highest priority • During that time, workload priority order is (3,1,2) for 32/80 of the time, (3,2,1) for 48/80 • Probability( priority order is 312 ) = 0.20(32/80) = 0.08
Three Transaction Workloads • Formulas on previous slide • Average predicted response time weighted by throughput 25 seconds (as expected) • Hard to understand intuitively • Software helps
note change from 32% Three Transaction Workloads
jobs currently on run queue Simulation
When the Model Fails • Real CPU uses round robin scheduling to deliver time slices • Short jobs never wait for long jobs to complete • That resembles shortest job first, so response time conservation law fails • At high utilization, simulation shows smaller response times than predicted by model • Response time conservation law yields conservative predictions
Scaling Degradation Predictions • V = ordered states s prob(s) V(s) • Each s is a permutation of (1,2, … , n) • Think of it as a vector in n-space • Those n! vectors lie on of a sphere • For n large they are pretty densely packed • Think of prob(s) as a discrete approximation to a probability distribution on the sphere • V is an integral
Monte Carlo • loop sampleSize times choose a permutation s at random from the distribution determined by the shares compute degradation vector V(s) accumulate V += prob(s)V(s) • sampleSize = 40000 works well independent of n!
Map shares to degradations(geometry) • Interpret shares as barycentric coordinates in the n-1 simplex • Study the geometry of the map from the simplex to the n-1 dimensional permutahedron • Easy when n=2: each is a line segment and map is linear
Mapping a triangle to a hexagon f3 = 1 f1 = 0 312 132 f1 = 1 M f3 = 0 321 123 wkl 1 high priority 213 231 wkl 1 low priority
Mapping a triangle to a hexagon f1 = 0 f1 = 1 {23}
What This Means • Add a strong statement that summarizes how you feel or think about this topic • Summarize key points you want your audience to remember