190 likes | 259 Views
Hardware Architectures for Power and Energy Adaptation. Phillip Stanley-Marbell. Outline. Motivation Related Research Architecture Experimental Evaluation Extensions Summary and Future work. Motivation.
E N D
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell
Outline • Motivation • Related Research • Architecture • Experimental Evaluation • Extensions • Summary and Future work
Motivation • Power consumption is becoming a limiting factor with scaling of technology to smaller feature sizes • Mobile/battery-powered computing applications • Thermal issues in high end servers • Low Power Design is not enough: • Power- and Energy-Aware Design • Adapt to non-uniform application behavior • Only use as many resources as required by application • This talk : Exploit processor-memory performance gap to save power, with limited performance degradation
Related Research • Reducing power dissipation in on-chip caches • Reducing instruction cacheleakage power dissipation [Powell et al, TVLSI ‘01] • Reducing dynamic power in set-associative caches and on-chip buffer structures [Dropsho et al, PACT ‘02] • Reducing power dissipation of CPU core • Compiler-directed dynamic voltage scaling of CPU core[Hsu, Kremer, Hsiao. ISLPED ‘01]
Single-issue in-order processors • Limited overlap of main memory access and computation CPU @ Vdd CPU @ Vdd/2 Target Application Class: Memory-Bound Applications • Memory-bound applications • Limited by memory system performance
Power-Performance Tradeoff • Detect memory-bound execution phases • Maintain sufficient information to determine compute / stall time ratio • Pros • Scaling down CPU core voltage yields significant energy savings (Energy Vdd2) • Cons • Performance hit (Delay Vdd)
Power Adaptation Unit (PAU) • Maintains information to determine ratio of compute to stall time • Entries allocated for instructions which cause CPU stalls • Intuitively, one table entry required per program loop • Fields: • State (I, A, T, V) • # instrs. executed (NINSTR) • Distance b/n stalls (STRIDE) • Saturating ‘Quality’ counter (Q) [From S-M et al, PACS 2002]
Slowdown factor, ∂, for a target 1% performance degradation: 0.01 • STRIDE + NINSTR ∂ = NINSTR PAU Table Entry State Machine If CPU at-speed, slow it down
for (x = 100;;) { if (x- - > 0) a = i; b = *n; c = *p++; } PAU table entries created for each assignment After 100 iterations, assignment to a stops Entries for b or c can take over immediately Example
Experimental Methodology • Simulated PAU as part of a single-issue embedded processor • Used Myrmigki simulator [S-M et al, ISLPED 2001] • Models Hitachi SH RISC embedded processor • 5 stage in-order pipeline • 8K unified L1, 100 cycle latency to main memory • Empirical instruction power model, from SH7708 device • Voltage scaling penalty of 1024 cycles, 14uJ • Investigated effect of PAU table size on performance, power • Intuitively, PAU table entries track program loops with repeated stalls
Effect of Table Size on Energy Savings • Single-entry PAU table provides 27% reduction in energy, on average • Scaling up to a 64-entry PAU table only provides additional 4%
Effect of Table Size on Performance • Single-entry PAU table incurs 0.75% performance degradation, on avg. • Large PAU table, leads to more aggressive behavior, increased penalty
Overall Effect of Table Size : Energy-Delay product • Considering both performance and power, there is little benefit from larger PAU table sizes
Extending the PAU structure • Multiprogramming environments • Superscalar architectures • Slowdown factor computation
PAU in Multiprogramming Environments • Only a single entry necessary per application • Amortize mem.-bound phase detection • Would be wasteful to flush PAU at each context switch (~10ms) • Extend PAU entries with an ID field: • CURID and IDMASK fields written to by OS
CPU @ Vdd CPU @ Vdd/2 PAU in Superscalar Architectures • Dependent computations are ‘stretched out’ • FUs with no dependent instructions unduly slowed down • Maintain separate instruction counters per FU: Drawback : Requires ability to run FUs in core at different voltages
Slowdown factor computation • Computation only performed on application phase change • Hardware solution would be wasteful • Solution : computation by software ISR • Compute ∂ , lookup discrete Vdd/Freq. by indexing into a lookup table • Similar software handler solution proposed in [Dropsho et al, 2002]
Summary & Future Work • PAU : Hardware identifies program regions (loops) with compute / memory stall mismatch • Due to nature of most programs, even single entry PAU is effective : can achieve 27% energy savings with only 0.75% perf. Degradation • Proposed extensions to PAU architecture • Future work • Evaluations with smaller miss penalties • Implementation of proposed extensions • More extensive evaluation of implications of applications