140 likes | 394 Views
Beyond DVFS: A First Look at Performance Under a Hardware-Enforced Power Bound. HPPAC 2012. Barry Rountree, Dong H. Ahn , Bronis R. de Supinski , David K. Lowenthal , Martin Schulz. Monday, May 21st. Computing under a power bound forces us to rethink performance. Traditional
E N D
Beyond DVFS:A First Look at Performance Under a Hardware-Enforced Power Bound HPPAC 2012 • Barry Rountree, Dong H. Ahn, Bronis R. de Supinski, • David K. Lowenthal, Martin Schulz • Monday, May 21st
Computing under a power bound forces us to rethink performance • Traditional • All components can operate at highest power level simultaneously • Power provisioned for “worst case” • Users are happily oblivious (about power) • Few if any applications limited by power • Exascale (if not sooner) • Not all components can operate at highest power level simultaneously • Power provisioning is best effort • Users must tune power for performance • Nearly every application limited by power
Computing under a power bound forces us to rethink performance • Traditional • Utilization measured in node-hours • Weak-scaling jobs perform best using as many nodes as possible • Running all components as fast as possible reliably leads to top performance • Exascale (if not sooner) • Utilization measured in kilowatt hours • Weak-scaling jobs may perform optimally with fewer, faster nodes • Running all components as fast as possible cannot be done. Running most components at identical speeds is suboptimal
An Unexpected Power Bound:Merlot cluster at LLNL Average Processor Power Bound exascale (?) rzmerl (Early April) rzmerl (Mid April) Average Processor Power Bound Sum of processor power draw divided by processor count must be at or below this level. Power (Watts) Lost performance Each processor uses some amount of power Total processor power divided by processor count should be less than the bound Long-term solution: Schedule power to optimize performance Short-term solution: Disable Turbo Boost globally Mid-term solution: Buy more power (This does not scale) Linpack + Intel Turbo Boost GHz non-turbo (2.6 GHz) max turbo (3.3 GHz) Processors
Scheduling Power with Processor Hardware: Intel’s RAPL • Runtime Average Power Limit (RAPL) • Measures cumulative joules (power x time) • Three separate power meters • Clamping on package and DRAM power • Turbo suppression • Effective frequency • libmsr currently under development
Domains and Features of Runing Average Power Limit Technology Introduced on Sandy Bridge Processors Onboard energy meters measure accumulated joules. Divide by time to get average power. Can place user-specified limit on average power over a user-specific time window. Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B
Bounding Package Power with RAPL Setting LOCK fixes power limits until reboot Two windows allows tweaking peak and average power Higher bound, smaller window for peak power Lower bound, wider window for average power Limits are ignored until enable bits are set Power limit is enforced using average watts over user specified window. Resolution: ~1ms Max Window: ~46ms Watts granularity: 0.125W Minimum power bound: 51W Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B
Bounding DRAM Power with RAPL Similar interface for DRAM power control Only one power limit supported Source: Intel 64 and IA-32 Software Developer’s Manual, Volume 3B
Processors are Heterogeneous Under a Power Bound rzzin mg.C.8 64 processors 34 power bounds No Power Bound Processors take similar time Significant variation in power Power variation expected and acceptable 51W Power Bound Processors require same amount of power Individual processor efficiency has not changed Efficiency variation manifests as performance variation Processors are heterogeneous under a power bound Where should the hot processors go? Is is worth paying a premium efficient processors?
Wide Variation in Application Package Power Draw rzmerl NPB C.8 234 processors Wide variation in power consumption across applications Provisioning power for most power-hungry application leaves remaining applications node-bound, not power-bound Avergae Watts Processors ordered by cg.C.8 average PKG power
Wide Variation in Application DRAM Power Draw rzmerl NPB C.8 234 processors Memory power substantially lower than package power Avergae Watts Processors ordered by cg.C.8 average PKG power
Exascale Is Not Only Bigger: Exascale Is Fundamentally Different • Overprovision hardware • Processors are cheap and plentiful • Power is not • Measure performance at max power consumption • May require turning off nodes • Running out of nodes before running out of power means application is not power-bound • Expect heterogeneous processor performance • Put most-efficient nodes on the critical path if possible • Put least-efficient nodes where they will do the least harm