Energy Optimization and Stability in Green Data Centers

Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden

Energy Management in Data Centers • Total consumption: 2% of energy spent in US (EPA estimate) • Energy bill is 20-50% of total profit • Energy expended on: • Computing (powering up racks of machines) • Sensors: Utilization, Delay, Throughput, … • Actuators: DVS, turning machines On/Off • Cooling • Sensors: Temperature, air flow, … • Actuators: Air-conditioning units, fans, …

Current Status • Increased emphasis on energy control • More “manipulation knobs” are introduced to manage energy and performance • Challenge • Knobs may interact in unexpected ways • Different performance and energy management policies may interfere with one another • Uncoordinated interference of multiple knobs can lead to instability or poor efficiency

Energy SavingA Tale of Two Policies • DVS + On/Off: more energy consumption than DVS or On/Off alone! • DVS alone • On/Off alone Empirical measurements from a 30-machine 3-tier testbed of a shopping site

Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems

Response Time Control Problem in VMs VM2 VM1 Goal: dynamically change CPU shares of VMs to meet RT constraint CPU has been popular for controlling response time With only CPU control, response time severely violated. Why?

Memory Utilization, Disk I/O, and CPU Consumption CPU as a function of memory utilization # of page faults as a function of memory utilization Page faults drastically increase after a certain threshold Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities

Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region

CPU and Memory Control Application-level performance Resource usage VMM VM 1 (App 1) CPU allocation CPU Controller CPU Scheduler Sr Sp Application SLOs Memory allocation Memory Controller Memory Manager VM n (App n) Sn Sp Resource usage Application-level performance CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90%

Performance of Joint Controllers with Synthetic Workload Cont. VM2 VM1 Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone

DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem

Results DVS + On/Off DVS alone On/Off alone Optimal

Energy SavingMeasurements from a Machine Room Bottom-Up + Off Bottom-Up Even Even Optimal Bottom-Up Bottom-Up Optimal Fixed cooling set point Fixed number of machines Holistic Optimization

Help the Admin: Administrative Cost is Sky Rocketing!

Diagnostics In software systems, key variables in adaptive actions are correlated Monitor changes in correlations to diagnose performance problems In mechanical systems, components are connected and correlated Correlations are broken, the system may not perform as expected

AC D R + U Diagnostics • Learning phase: learn adaptation graph by calculating correlation coefficient AC D R + 2. At run-time: periodically recalculate the sign of edges in adaptation graph + U Learned Estimated 3. Check the sign Adaptation Graph Backup Policy Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regulation Policy Target System Sensors Target performance reference Monitor the target system System output

AC D R + + U Diagnostics Stop the component causing the sign problem Execute backup action: open loop action Try several times Backup Policy Adaptation Graph Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regluation Policy Target System Sensors Target performance reference Monitor the target system System output

Example • Increased workload interrupt handling to polling  utilization drops • Controller tries to accept more requests Aggrevate the situation  Most new requests dropped by kernel. • No prioritization enforced • Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism: • Admission control based on utilization. • It drops lower priority request first + + AC AC Util Pd Util Pd + Req Req

DiagnosticsExample 1. Network processing is overloaded: switching from interrupt handling to polling 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation CPU utilization # of network interrupts 2. Closed loop - violation 1. Closed loop Correlation ReqUtil becomes broken

More on Diagnostics • Correlations between continuous variables do not uncover problems due to sequences of discrete events • Focus on runtime events related to performance • Ex) turn on machines. Decrease DVS, send a packet, etc. • Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases • Data mining technique: discriminative sequence analysis

Main Idea • Log different events during runtime • Most of the time the system works • Occasionally it performs poorly • Generate the frequent sequences of events that occurs when the system works correctly • Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior • Identify the “culprit” sequences of events that are found only in the latter case but not the former.

A Case Study on a “Hot” Day: Throughput of a Server Farm Low Throughput

Three Performance Control Policies • Thermal Management Policy • Puts machine to sleep if machine is overheated • Energy Aware Load Balancer • Distributes load based on average CPU utilization • Attempts to minimize the number of machines in use • Machine On/Off Policy • Turns off idle machines to save energy

Regular Operating Condition Maximum temperature is well Below 60 degrees

Anomalous Condition Maximum temperature is above 60 degrees

Anomalous Condition Maximum temperature is above 60 degrees Eventually, only the overheated machine remained on!

Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65,

Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65, Oops: Utilization is computed based on a recent time average (including “sleep” time)  Artificially low if machine sleeps

What was going on? No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines

Conclusions (the needs) • Must Identify the right knobs to manipulate (e.g., example with virtual machine memory allocation) • Must manage them in a jointly optimal manner to avoid instability or poor performance • Must develop automated self-diagnostic techniques to reduce administrator effort

Conclusions (the tools) • Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers • Advances in event-based control offer opportunities to significantly reduce actuation overhead (e.g., number of times machines are tuned on/off without degrading performance • Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems

Energy Optimization and Stability in Green Data Centers

Energy Optimization and Stability in Green Data Centers

Presentation Transcript

Energy Optimization

Utility-Function-Driven Energy-Efficient Cooling in Data Centers

Chapter 12: Green Data Centers

Flyways in Data Centers

Energy and Political Stability

Green Web Services: Improving Energy Efficiency in Data Centers via Workload Predictions

Leveraging Renewable Energy in Data Centers

ICT, energy and data centers ; a holistic view

GreenHadoop : Leveraging Green Energy in Data-Processing Frameworks

Optimization and Stability in Games with Restricted Interactions

Energy-efficient, Thermal-aware Data Placement, Replication, and Scheduling in Data Centers

Green IT and Data Centers

Energy and heat-aware metrics for data centers

Energy Efficiency in Cloud Data Centers: Energy Efficient VM Placement for Cloud Data Centers

Energy Efficiency Leadership for Data Centers and IT

POWER and ENERGY OPTIMIZATION

Green and Greener: The path to (more) sustainable data centers

Green Outsourcing, Energy Efficient Data Centers and Sustainable Supply Chain Agreements

GreenHadoop : Leveraging Green Energy in Data-Processing Frameworks

The Smart Grid: Green IT and Data Centers

Are Green Data Centers The Future?

Energy and HVAC Optimization