1 / 35

Energy Optimization and Stability in Green Data Centers

Energy Optimization and Stability in Green Data Centers. Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden. Energy Management in Data Centers.

werner
Download Presentation

Energy Optimization and Stability in Green Data Centers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden

  2. Energy Management in Data Centers • Total consumption: 2% of energy spent in US (EPA estimate) • Energy bill is 20-50% of total profit • Energy expended on: • Computing (powering up racks of machines) • Sensors: Utilization, Delay, Throughput, … • Actuators: DVS, turning machines On/Off • Cooling • Sensors: Temperature, air flow, … • Actuators: Air-conditioning units, fans, …

  3. Current Status • Increased emphasis on energy control • More “manipulation knobs” are introduced to manage energy and performance • Challenge • Knobs may interact in unexpected ways • Different performance and energy management policies may interfere with one another • Uncoordinated interference of multiple knobs can lead to instability or poor efficiency

  4. Energy SavingA Tale of Two Policies • DVS + On/Off: more energy consumption than DVS or On/Off alone! • DVS alone • On/Off alone Empirical measurements from a 30-machine 3-tier testbed of a shopping site

  5. Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems

  6. Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems

  7. Response Time Control Problem in VMs VM2 VM1 Goal: dynamically change CPU shares of VMs to meet RT constraint CPU has been popular for controlling response time With only CPU control, response time severely violated. Why?

  8. Memory Utilization, Disk I/O, and CPU Consumption CPU as a function of memory utilization # of page faults as a function of memory utilization Page faults drastically increase after a certain threshold Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities

  9. Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region

  10. CPU and Memory Control Application-level performance Resource usage VMM VM 1 (App 1) CPU allocation CPU Controller CPU Scheduler Sr Sp Application SLOs Memory allocation Memory Controller Memory Manager VM n (App n) Sn Sp Resource usage Application-level performance CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90%

  11. Performance of Joint Controllers with Synthetic Workload Cont. VM2 VM1 Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory

  12. Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems

  13. DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone

  14. DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem

  15. Results DVS + On/Off DVS alone On/Off alone Optimal

  16. Energy SavingMeasurements from a Machine Room Bottom-Up + Off Bottom-Up Even Even Optimal Bottom-Up Bottom-Up Optimal Fixed cooling set point Fixed number of machines Holistic Optimization

  17. Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems

  18. Help the Admin: Administrative Cost is Sky Rocketing!

  19. Diagnostics In software systems, key variables in adaptive actions are correlated Monitor changes in correlations to diagnose performance problems In mechanical systems, components are connected and correlated Correlations are broken, the system may not perform as expected

  20. AC D R + U Diagnostics • Learning phase: learn adaptation graph by calculating correlation coefficient AC D R + 2. At run-time: periodically recalculate the sign of edges in adaptation graph + U Learned Estimated 3. Check the sign Adaptation Graph Backup Policy Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regulation Policy Target System Sensors Target performance reference Monitor the target system System output

  21. AC D R + + U Diagnostics Stop the component causing the sign problem Execute backup action: open loop action Try several times Backup Policy Adaptation Graph Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regluation Policy Target System Sensors Target performance reference Monitor the target system System output

  22. Example • Increased workload interrupt handling to polling  utilization drops • Controller tries to accept more requests Aggrevate the situation  Most new requests dropped by kernel. • No prioritization enforced • Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism: • Admission control based on utilization. • It drops lower priority request first + + AC AC Util Pd Util Pd + Req Req

  23. DiagnosticsExample 1. Network processing is overloaded: switching from interrupt handling to polling 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation CPU utilization # of network interrupts 2. Closed loop - violation 1. Closed loop Correlation ReqUtil becomes broken

  24. More on Diagnostics • Correlations between continuous variables do not uncover problems due to sequences of discrete events • Focus on runtime events related to performance • Ex) turn on machines. Decrease DVS, send a packet, etc. • Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases • Data mining technique: discriminative sequence analysis

  25. Main Idea • Log different events during runtime • Most of the time the system works • Occasionally it performs poorly • Generate the frequent sequences of events that occurs when the system works correctly • Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior • Identify the “culprit” sequences of events that are found only in the latter case but not the former.

  26. A Case Study on a “Hot” Day: Throughput of a Server Farm Low Throughput

  27. Three Performance Control Policies • Thermal Management Policy • Puts machine to sleep if machine is overheated • Energy Aware Load Balancer • Distributes load based on average CPU utilization • Attempts to minimize the number of machines in use • Machine On/Off Policy • Turns off idle machines to save energy

  28. Regular Operating Condition Maximum temperature is well Below 60 degrees

  29. Anomalous Condition Maximum temperature is above 60 degrees

  30. Anomalous Condition Maximum temperature is above 60 degrees Eventually, only the overheated machine remained on!

  31. Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65,

  32. Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65, Oops: Utilization is computed based on a recent time average (including “sleep” time)  Artificially low if machine sleeps

  33. What was going on? No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines

  34. Conclusions (the needs) • Must Identify the right knobs to manipulate (e.g., example with virtual machine memory allocation) • Must manage them in a jointly optimal manner to avoid instability or poor performance • Must develop automated self-diagnostic techniques to reduce administrator effort

  35. Conclusions (the tools) • Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers • Advances in event-based control offer opportunities to significantly reduce actuation overhead (e.g., number of times machines are tuned on/off without degrading performance • Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems

More Related