370 likes | 482 Views
Stochastic Models and Analysis for Resource Management in Server Farms. Thesis Proposal VARUN GUPTA. SMART : S tochastic M odels and A nalysis for R esource managemen T in server farms. Thesis Proposal VARUN GUPTA.
E N D
Stochastic Models and Analysis for Resource Management in Server Farms Thesis Proposal VARUN GUPTA
SMART : Stochastic Models and Analysis for Resource managemenT in server farms Thesis Proposal VARUN GUPTA
SARCASM : Stochastic models and Analysis for ResourCemAnagement in Server farMs Thesis Proposal VARUN GUPTA
Server farm popularity is on the rise Supercomputers Data center pods + high compute capacity + incremental growth + fault-tolerance + efficient resource utilization + energy efficiency + high parallelism Cloud computing Array-of-Wimpy-Nodes Multi-core chips
A simple server farm “template” Design Choice 1: Which server to send a job to? Design Choice 2: Scheduling policy for backend servers? Backend servers Frontend dispatcher/router Design Choice 3: When to turn servers on/off for efficiency? Design Choice 4: How many servers to buy? Of what capacity?
A simple server farm “template” Design Choice 1: Dispatching policy Design Choice 2: Scheduling policy Backend servers Frontend dispatcher/router Design Choice 3: Dynamic capacity scaling Design Choice 4: Provisioning
Thesis Goal Stochastic modeling and analysis to answer the questions faced by server farm designers/managers Long history of stochastic modeling and analysis • Erlang (1909): Operator provisioning in telephone exchanges • Inventory/production management • Call center staffing Several gaps between traditional models and compute server farms • New constraints • New opportunities • New metrics Bridge these gaps by developing new models and analysis techniques relevant to requirements of today’s server farms
Application 1 : Web server farms/cluster computing Immediate dispatch
Application 1 : Web server farms/cluster computing PS PS PS Immediate dispatch Q: Good load balancing dispatchers? How many servers? Existing work limited to First-Come-First-Served servers or Exponential job size distribution GAP : Processor Sharing servers + high-variance job sizes 12
Model : Dispatching policies for M/G/K-PS PS K Homogeneous Servers PS Poisson arrivals ??? PS • Join-Shortest-Queue (JSQ) : most popular • Balances load • Greedy Q: Is JSQ optimal for general job size distribution? Bonomi [90] : Optimal for Exponential job size distribution when job sizes unknown Q: Analysis of JSQ for general job size distribution? 13
PS ??? PS Simulation Results RANDOM Mean Response Time Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)
PS ??? PS Simulation Results RANDOM Mean Response Time JSQ Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)
PS ??? PS Simulation Results RANDOM Mean Response Time Round -Robin JSQ Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)
PS ??? PS Simulation Results RANDOM Mean Response Time Round -Robin JSQ OPT-0 Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)
Model : Dispatching policies for M/G/K-PS PS PS K Homogeneous Servers Poisson arrivals JSQ PS Conjecture: JSQ is near-optimal (even among size-aware dispatching policies) Performance of JSQ is “nearly-insensitive” to the job size distribution 18
PS JSQ PS • Contribution 1: The Single-Queue-Approximation • Goal : Approx. for mean response time under Exponential job sizes • Compensate for the effect of other queues via state-dependent arrival rates • λ(n) easier to approximate (only need to worry about λ(1), λ(2)) • < 2% error in mean response time for up to 64 servers M/M/K-JSQ/PS Mn/M/1/PS ≈ λ(n) PS λ(n) = state-dependent arrival rate [Performance’07]
Contribution 2: Many-server heavy-traffic analysis (PROPOSED) • Goal 1: Quantify the “near-insensitive” behavior • Goal 2: Optimal dispatching policies for heterogeneous servers • Hard to prove anything in general, must resort to limiting regimes • The many-server heavy-traffic scaling • Shows the effect of job size variability • Intuition into behavior of JSQ PS PS PS PS λ = K - constant JSQ K → ∞
Application 2 : Energy-Performance trade-off in Data centers/Cloud computing Q: When to turn servers ON/OFF to adapt to demand? Existing work assumes zero setup delays, knowledge of future demand pattern GAP : setup penalties non-zero + unpredictable demand patterns
Model : Dynamic capacity scaling in M/M/∞ with setup delays ON ON Poisson arrivals SETUP First-In-First-Out buffer DELAYEDOFF is asymptotically optimal OFF • Contribution: New traffic-oblivious policy DELAYEDOFF • Servers turn off after idle for twait • If arrival sees all servers busy, turn a new server ON • Most-Recently-Busy (MRB) dispatching: send job to server which idled last • Theorem: Under DELAYEDOFF, as the load , the number of ON servers is concentrated around [Performance’10]
Simulation Results for DELAYEDOFF PROPOSED: Refine DELAYEDOFF, prove performance guarantees
Application 3 : Fully replicated databases First-In-First-Out buffer Q: How many servers, and what speed? No exact analysis, approximations good for low job size variance GAP : Job sizes have very high variance
Application 3 : Fully replicated databases Poisson arrivals First-In-First-Out buffer
Model : M/G/K/FCFS Poisson arrivals squared coeff. of variation of job size dist. typically C2 > 20 First-In-First-Out buffer The Holy Grail of queueing theory (model for many other applications) yet no exact analysis! Lee-Longton [1959] :
Contribution 1: Inapproximability results • Goal: No accurate approx. based only on first 2 moments • Pick a subclass of distributions • Analytically tractable • Large enough to fix 2 moments, but wiggle room to prove gap Lee-Longton Approximation {G | 2 moments} E[Delay] H2 Increasing 3rd moment → [QUESTA’10]
Contribution 2: Tight moment-based bounds • Goal: Better approximation using n moments? • [QUESTA ’10] : Conjectured extremal distributions • M/G/K/FCFS under light-traffic • Extremality should be invariant to load • Verify conjectures for n = 2,3 • Also for other queueing systems with no exact analysis {G | n moments} tight bounds | n moments ? ? , 4, 5, 6… proposed work E[Delay]
Application 5 : Managing VMs in the cloud 500MB 1GB 1GB 2GB 1.5GB 1GB
Application 5 : Managing VMs in the cloud 500M 1G 500M 250M Q: Which server to start VM on? What capacity servers to buy? Assumption of permanent items GAP : VMs depart + VM migration possible Contribution : Stochastic bin packing model with job departure/migration PROPOSED: develop packing/migration schemes for efficient packing
Application GAP Status Proposed Work • Optimal dispatch policies for No analysis Web server 70% Completed heterogeneous servers for PS server • Characterizing “near - farms Nov ‘10 - Jan ‘11 farms insensitivity” Speed Setup • Refine the proposed Energy 80% Completed penalties + DELAYEDOFF policy management in unpredictable • Performance guarantees for Feb - Mar ’11 Data centers # jobs at server demands traffic - oblivious capacity scaling VM VM VM PS Verifying conjectures on tight • Fully replicated DBs High variance JSQ Completed moment - based bounds beyond in job sizes PS the scope of the thesis M/G/K ON Database Thrashing Completed servers SETUP OFF Develop and analyze heuristics VM migration 10% Completed for online stochastic bin packing Dispatch VM management and with item departures and Oct’ 10 - Jan ’11 departures migrations Expected Graduation: MAY 2011
References Other Work