1 / 37

Stochastic Models and Analysis for Resource Management in Server Farms

Stochastic Models and Analysis for Resource Management in Server Farms. Thesis Proposal VARUN GUPTA. SMART : S tochastic M odels and A nalysis for R esource managemen T in server farms. Thesis Proposal VARUN GUPTA.

torgny
Download Presentation

Stochastic Models and Analysis for Resource Management in Server Farms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stochastic Models and Analysis for Resource Management in Server Farms Thesis Proposal VARUN GUPTA

  2. SMART : Stochastic Models and Analysis for Resource managemenT in server farms Thesis Proposal VARUN GUPTA

  3. SARCASM : Stochastic models and Analysis for ResourCemAnagement in Server farMs Thesis Proposal VARUN GUPTA

  4. Server farm popularity is on the rise Supercomputers Data center pods + high compute capacity + incremental growth + fault-tolerance + efficient resource utilization + energy efficiency + high parallelism Cloud computing Array-of-Wimpy-Nodes Multi-core chips

  5. A simple server farm “template” Design Choice 1: Which server to send a job to? Design Choice 2: Scheduling policy for backend servers? Backend servers Frontend dispatcher/router Design Choice 3: When to turn servers on/off for efficiency? Design Choice 4: How many servers to buy? Of what capacity?

  6. A simple server farm “template” Design Choice 1: Dispatching policy Design Choice 2: Scheduling policy Backend servers Frontend dispatcher/router Design Choice 3: Dynamic capacity scaling Design Choice 4: Provisioning

  7. Thesis Goal Stochastic modeling and analysis to answer the questions faced by server farm designers/managers Long history of stochastic modeling and analysis • Erlang (1909): Operator provisioning in telephone exchanges • Inventory/production management • Call center staffing Several gaps between traditional models and compute server farms • New constraints • New opportunities • New metrics Bridge these gaps by developing new models and analysis techniques relevant to requirements of today’s server farms

  8. A summary of explored gaps

  9. A glimpse of results

  10. A summary of explored gaps

  11. Application 1 : Web server farms/cluster computing Immediate dispatch

  12. Application 1 : Web server farms/cluster computing PS PS PS Immediate dispatch Q: Good load balancing dispatchers? How many servers? Existing work limited to First-Come-First-Served servers or Exponential job size distribution GAP : Processor Sharing servers + high-variance job sizes 12

  13. Model : Dispatching policies for M/G/K-PS PS K Homogeneous Servers PS Poisson arrivals ??? PS • Join-Shortest-Queue (JSQ) : most popular • Balances load • Greedy Q: Is JSQ optimal for general job size distribution? Bonomi [90] : Optimal for Exponential job size distribution when job sizes unknown Q: Analysis of JSQ for general job size distribution? 13

  14. PS ??? PS Simulation Results RANDOM Mean Response Time Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)

  15. PS ??? PS Simulation Results RANDOM Mean Response Time JSQ Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)

  16. PS ??? PS Simulation Results RANDOM Mean Response Time Round -Robin JSQ Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)

  17. PS ??? PS Simulation Results RANDOM Mean Response Time Round -Robin JSQ OPT-0 Det Exp Bim-1 Weib-2 Bim-2 Weib-1 Increasing job-size variance (same mean)

  18. Model : Dispatching policies for M/G/K-PS PS PS K Homogeneous Servers Poisson arrivals JSQ PS Conjecture: JSQ is near-optimal (even among size-aware dispatching policies) Performance of JSQ is “nearly-insensitive” to the job size distribution 18

  19. PS JSQ PS • Contribution 1: The Single-Queue-Approximation • Goal : Approx. for mean response time under Exponential job sizes • Compensate for the effect of other queues via state-dependent arrival rates • λ(n) easier to approximate (only need to worry about λ(1), λ(2)) • < 2% error in mean response time for up to 64 servers M/M/K-JSQ/PS Mn/M/1/PS ≈ λ(n) PS λ(n) = state-dependent arrival rate [Performance’07]

  20. Contribution 2: Many-server heavy-traffic analysis (PROPOSED) • Goal 1: Quantify the “near-insensitive” behavior • Goal 2: Optimal dispatching policies for heterogeneous servers • Hard to prove anything in general, must resort to limiting regimes • The many-server heavy-traffic scaling • Shows the effect of job size variability • Intuition into behavior of JSQ PS PS PS PS λ = K - constant JSQ K → ∞

  21. A summary of explored gaps

  22. Application 2 : Energy-Performance trade-off in Data centers/Cloud computing Q: When to turn servers ON/OFF to adapt to demand? Existing work assumes zero setup delays, knowledge of future demand pattern GAP : setup penalties non-zero + unpredictable demand patterns

  23. Model : Dynamic capacity scaling in M/M/∞ with setup delays ON ON Poisson arrivals SETUP First-In-First-Out buffer DELAYEDOFF is asymptotically optimal OFF • Contribution: New traffic-oblivious policy DELAYEDOFF • Servers turn off after idle for twait • If arrival sees all servers busy, turn a new server ON • Most-Recently-Busy (MRB) dispatching: send job to server which idled last • Theorem: Under DELAYEDOFF, as the load , the number of ON servers is concentrated around [Performance’10]

  24. Simulation Results for DELAYEDOFF PROPOSED: Refine DELAYEDOFF, prove performance guarantees

  25. A summary of explored gaps

  26. Application 3 : Fully replicated databases

  27. Application 3 : Fully replicated databases First-In-First-Out buffer Q: How many servers, and what speed? No exact analysis, approximations good for low job size variance GAP : Job sizes have very high variance

  28. Application 3 : Fully replicated databases Poisson arrivals First-In-First-Out buffer

  29. Model : M/G/K/FCFS Poisson arrivals squared coeff. of variation of job size dist. typically C2 > 20 First-In-First-Out buffer The Holy Grail of queueing theory (model for many other applications) yet no exact analysis! Lee-Longton [1959] :

  30. Contribution 1: Inapproximability results • Goal: No accurate approx. based only on first 2 moments • Pick a subclass of distributions • Analytically tractable • Large enough to fix 2 moments, but wiggle room to prove gap Lee-Longton Approximation {G | 2 moments} E[Delay] H2 Increasing 3rd moment → [QUESTA’10]

  31. Contribution 2: Tight moment-based bounds • Goal: Better approximation using n moments? • [QUESTA ’10] : Conjectured extremal distributions • M/G/K/FCFS under light-traffic • Extremality should be invariant to load • Verify conjectures for n = 2,3 • Also for other queueing systems with no exact analysis {G | n moments} tight bounds | n moments ? ? , 4, 5, 6…  proposed work E[Delay]

  32. A summary of explored gaps

  33. Application 5 : Managing VMs in the cloud 500MB 1GB 1GB 2GB 1.5GB 1GB

  34. Application 5 : Managing VMs in the cloud 500M 1G 500M 250M Q: Which server to start VM on? What capacity servers to buy? Assumption of permanent items GAP : VMs depart + VM migration possible Contribution : Stochastic bin packing model with job departure/migration PROPOSED: develop packing/migration schemes for efficient packing

  35. Time line for proposed work

  36. Application GAP Status Proposed Work • Optimal dispatch policies for No analysis Web server 70% Completed heterogeneous servers for PS server • Characterizing “near - farms Nov ‘10 - Jan ‘11 farms insensitivity” Speed Setup • Refine the proposed Energy 80% Completed penalties + DELAYEDOFF policy management in unpredictable • Performance guarantees for Feb - Mar ’11 Data centers # jobs at server demands traffic - oblivious capacity scaling VM VM VM PS Verifying conjectures on tight • Fully replicated DBs High variance JSQ Completed moment - based bounds beyond in job sizes PS the scope of the thesis M/G/K ON Database Thrashing Completed servers SETUP OFF Develop and analyze heuristics VM migration 10% Completed for online stochastic bin packing Dispatch VM management and with item departures and Oct’ 10 - Jan ’11 departures migrations Expected Graduation: MAY 2011

  37. References Other Work

More Related