No Power Struggles:Coordinated multi-level power management for the data center

Ramya (UCSB), Parthasarathy et al (HP Labs) No Power Struggles:Coordinated multi-level power management for the data center

Overview • Power delivery, consumption and cooling problems in a data center are being tackled currently by several systems that address “separate” aspects of these problems either locally/globally, in hardware/software. • When these systems are deployed simultaneously, the policies of one tends to interfere with the others

Overview… • The lack of coordination amongst such systems leads to undesirable consequences. • This paper proposes a “Global Power Management Solution” that coordinates these individual solutions.

Classifying the existing power management solutions.. • Approach used: localized/distributed resource management, VMs • Power control : voltage scaling, power states, turning off machines • Implementation scope: server/cluster/data center level • Optimization requirements and constraints: accept performance loss?, allow power budget violation ?

In a nutshell.. • “Tracking” problem – optimize power consumption while delivering performance. • “Capping” problem – Optimize power provisioning and cooling so as not to violate the power budget. • “Optimization” problem – maximize power saving while minimizing performance loss. (ACPIs, VMs, etc)

Representative Power Management Solutions • Efficiency Controller (EC -tracking) – optimize per server avg. power consumption. Adjusts ACPI P- states based on past resource usage to manage “estimated” future demand. • Server Manager (SM – capping) – Reduce P-state of a server on violation of Power budget.

Representative solutions.. • Enclosure Manager (EM ) – thermal power capping at blade level • Group Manager (GM ) – at rack or data center level • These two monitor power usage on sets of machines and re-provision power to maintain group power budget (determined manually or mandated by higher level power managers)

Representative solutions.. • Virtual Machine Controller (VMC) – reduce average power usage across a set of machines by workload consolidation, turning of idling machines, etc.

Power Struggles.. What happens if these solutions are deployed simultaneously ?

Power Struggles - examples • EC and the SM both operate on the same knob/actuator (P-state) but for different metrics. If uncoordinated, the EC can potentially overwrite the SM leading to power budget violations and eventual thermal failover! – A correctness issue.

Examples.. • If the VMC and group cappers are uncoordinated, the VMC can consolidate more capacity onto a collection of servers than allowed by the group power budget. • In addition to excessive performance violations (inefficiency), the VMC can potentially react to the lower utilization (because of power capping) and pack even more workloads onto the server, leading to a vicious cycle and system instability

Design Challenges of a Coordination System • Interaction between different controllers (EC, SM, EM, etc) must maintain “correctness, stability and efficiency”. • Global Awareness of the “presence” of other controllers while having minimal/zero knowledge of their properties. • Adaptability and Scalability – new controllers with same/different properties, new applications, etc.

Design Challenges - Sensitivity Issues. • Overlapping functionalities and policies of controllers – can they be mitigated ? • Is the Coordinated Management System agnostic to the deployed systems and applications (workloads) ?

The Design

The Design.. • Use of feedback control loops. • Measure the required “metric”, compare with the “reference” value and manipulate the actuator based on the error so that the output follows the reference.

Details.. • Diagram • Efficiency Controller EC: • Reference utilization rref • Actual utilization ri • If ri < rrefadjust Actuator A (P-State) ie reduce from say P0 to P4, resulting in higher utilization and lower power usage.

Details.. • Diagram • Server Manager SM: • Power Capping by measuring per server power consumption • If current consumption exceeds “power budget”, SM “INCREASES rref “ thereby allowing the EC to reduce the P-State of the machine • In effect, EC and SM use rref as communication channel.

Design.. • EM & GM: • Same principle as SM. Compare current power usage against ref. power budget and assign new values to lower level servers ( EM ->SM, GM->EM) based on some policy (FIFO, random, etc). • The lower level servers pick the “minimum of upper level recommendation and their own local power budget”.

Design.. • VMCs: • Use Actual utilization instead of “apparent” utilization (100% at P0 is not same as 100% at P3). • Supplied with data about approx power budget at various levels. • Also supplied with data about current power budget violations at various levels (through CIM) • The above three enable the VMCs to consolidate right workloads and making sure that the consolidated servers don’t violate the power budgets nor fall into the vicious cycle mentioned earlier.

Summary of changes to the controllers

Modeling the Controllers • Power – Performance Model – run actual workloads on hardware at different utilization levels and measure the power and performance. • Through curve-fitting of the simulation data, obtain linear models that represent the controller behavior.

Modeling.. • EC - scaled up or down by λ (changes proportional to error in utilization). • r_ref is increased by SM in case of power budget violation cap_loc, resulting in EC lowering the power states of the machines.

Modeling.. • SM: manipulates r_ref of EC if its power budget violates cap_loc , subject to a cap determined by βloc factor. • EM & GM – operate on a fair share policy, power allocated to a component is proportional to power consumed in last interval

Modeling.. • VMCs – Constrained Optimization Problem to map n VMs to m servers (decision variable matrix X). • Include total power consumption and migration overhead (αM ) in the calculation • Consider Server capacity constraints

Modeling VMCs.. • Consider local, enclosure and group level power budget constraints • The level of consolidation is tuned by tuning the power budget buffers based on the violations at different levels.

Modeling VMCs.. • Equations 1 to 6 depict a 0-1 integer optimization problem. • The authors use a greedy bin packing algorithm that yields an approximate optimal solution for the placement of VMs

Evaluation • How? • Real time deployment in Data Center or a full-system simulation ? • Impractical, limits the set of use case scenarios that can be studied due to the actual system being tested • Use of trace-driven simulation – use real world traces of enterprise deployments that would enable detailed workload modeling and evaluation of tradeoffs at policy and system levels. -?

Metrics used • Aggregate Power Saving, performance loss and power budget violation at SM, EM and GM levels. • No peak power saving is measured. • No workload queuing i.e. if workload exceeds capacity, there is performance loss due to power capping. No demand carry over.

Experimentation • 180 workload traces (databases, web servers, remote desktops, e-commerce, etc). • Create different types of mixes (real & synthetic) from this set to exercise different utilization scenarios. • SUT – A low power Blade server A and an entry level 2U server B. • Experiment with different power budgets and also study the sensitivity of this architecture by varying the time constants.

Power – Performance models for Blade A and Server B

Results Baseline: No power management

Results.. • Base Results: • Coordinated – 64% reduction in power consumption, 3% performance degradation and 5% power budget violation • Uncoordinated – 12 % performance loss and 7% budget violation. • Sensitivity towards different Systems: • Blade A - 5 p-states over higher power range • Server B - 6 p-states over low power range. • Blade A’s absolute power saving > Server B. • Implies, “Range of Power control is more important than its granularity”

Results.. • Variation for different workloads • At low utilization, VMC is major contributor to savings (assuming idle machines are “turned off”). • As utilization increases, benefits of VMC decrease while the combination of EC & VMC is better (i.e. a Coordinated Solution is better than a single one). • If idle m/c are not switched off, savings drop “significantly”!

No Power Struggles:Coordinated multi-level power management for the data center