Hardware Architectures for Power and Energy Adaptation

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell

Outline • Motivation • Related Research • Architecture • Experimental Evaluation • Extensions • Summary and Future work

Motivation • Power consumption is becoming a limiting factor with scaling of technology to smaller feature sizes • Mobile/battery-powered computing applications • Thermal issues in high end servers • Low Power Design is not enough: • Power- and Energy-Aware Design • Adapt to non-uniform application behavior • Only use as many resources as required by application • This talk : Exploit processor-memory performance gap to save power, with limited performance degradation

Related Research • Reducing power dissipation in on-chip caches • Reducing instruction cacheleakage power dissipation [Powell et al, TVLSI ‘01] • Reducing dynamic power in set-associative caches and on-chip buffer structures [Dropsho et al, PACT ‘02] • Reducing power dissipation of CPU core • Compiler-directed dynamic voltage scaling of CPU core[Hsu, Kremer, Hsiao. ISLPED ‘01]

Single-issue in-order processors • Limited overlap of main memory access and computation CPU @ Vdd CPU @ Vdd/2 Target Application Class: Memory-Bound Applications • Memory-bound applications • Limited by memory system performance

Power-Performance Tradeoff • Detect memory-bound execution phases • Maintain sufficient information to determine compute / stall time ratio • Pros • Scaling down CPU core voltage yields significant energy savings (Energy  Vdd2) • Cons • Performance hit (Delay  Vdd)

Power Adaptation Unit (PAU) • Maintains information to determine ratio of compute to stall time • Entries allocated for instructions which cause CPU stalls • Intuitively, one table entry required per program loop • Fields: • State (I, A, T, V) • # instrs. executed (NINSTR) • Distance b/n stalls (STRIDE) • Saturating ‘Quality’ counter (Q) [From S-M et al, PACS 2002]

Slowdown factor, ∂, for a target 1% performance degradation: 0.01 • STRIDE + NINSTR ∂ = NINSTR PAU Table Entry State Machine If CPU at-speed, slow it down

for (x = 100;;) { if (x- - > 0) a = i; b = *n; c = *p++; } PAU table entries created for each assignment After 100 iterations, assignment to a stops Entries for b or c can take over immediately Example

Experimental Methodology • Simulated PAU as part of a single-issue embedded processor • Used Myrmigki simulator [S-M et al, ISLPED 2001] • Models Hitachi SH RISC embedded processor • 5 stage in-order pipeline • 8K unified L1, 100 cycle latency to main memory • Empirical instruction power model, from SH7708 device • Voltage scaling penalty of 1024 cycles, 14uJ • Investigated effect of PAU table size on performance, power • Intuitively, PAU table entries track program loops with repeated stalls

Effect of Table Size on Energy Savings • Single-entry PAU table provides 27% reduction in energy, on average • Scaling up to a 64-entry PAU table only provides additional 4%

Effect of Table Size on Performance • Single-entry PAU table incurs 0.75% performance degradation, on avg. • Large PAU table, leads to more aggressive behavior, increased penalty

Overall Effect of Table Size : Energy-Delay product • Considering both performance and power, there is little benefit from larger PAU table sizes

Extending the PAU structure • Multiprogramming environments • Superscalar architectures • Slowdown factor computation

PAU in Multiprogramming Environments • Only a single entry necessary per application • Amortize mem.-bound phase detection • Would be wasteful to flush PAU at each context switch (~10ms) • Extend PAU entries with an ID field: • CURID and IDMASK fields written to by OS

CPU @ Vdd CPU @ Vdd/2 PAU in Superscalar Architectures • Dependent computations are ‘stretched out’ • FUs with no dependent instructions unduly slowed down • Maintain separate instruction counters per FU: Drawback : Requires ability to run FUs in core at different voltages

Slowdown factor computation • Computation only performed on application phase change • Hardware solution would be wasteful • Solution : computation by software ISR • Compute ∂ , lookup discrete Vdd/Freq. by indexing into a lookup table • Similar software handler solution proposed in [Dropsho et al, 2002]

Summary & Future Work • PAU : Hardware identifies program regions (loops) with compute / memory stall mismatch • Due to nature of most programs, even single entry PAU is effective : can achieve 27% energy savings with only 0.75% perf. Degradation • Proposed extensions to PAU architecture • Future work • Evaluations with smaller miss penalties • Implementation of proposed extensions • More extensive evaluation of implications of applications

Questions

Hardware Architectures for Power and Energy Adaptation

Hardware Architectures for Power and Energy Adaptation

Presentation Transcript

Energy and Power

Energy and Power

Power and Energy

Energy and Power

POWER AND ENERGY

Power and Energy

Power and Energy

Database Architectures for New Hardware

Hardware Transactional Memory for GPU Architectures*

Energy and power

Energy and Power

Energy and Power

Energy and Power

Capabilities, Metadata and Adaptation Architectures

Session 1: Novel Hardware Architectures

Energy and Power

Energy and Power

Hardware Transactional Memory for GPU Architectures

Power and Energy

ENERGY AND POWER

ENERGY AND POWER

Hardware Architectures to Support Low Power Natural I/O Applications