Energy Efficient Instruction Cache for Wide-issue Processors

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine

Motivation • Power dissipation is a serious problem • It is important for both high performance, • e.g. COMPAQ Alpha, MIPS R10K, Intel x86, • As well as embedded processors • E.g. ARM, Strongarm, etc… • Our current research is focused on reducing the average energy consumption of high-performance embedded processors • MIPS R20K, IBM/Motorola PowerPC, Mobile Pentium,..

We want to address the energy consumption via • architecture • compiler • Technology is, to the first order, an orthogonal parameter and is NOT considered • So the first question is • What are the major sources of energy dissipation? • Hard to find this info

Experimental Setup • We started with Wattch (Princeton) • architectural-level power simulator • based on SimpleScalar (sim-outorder) simulator • can specify technology, clock and voltage • computes power for major internal units of processor • parameterizable models for most of the components • mostly memory-based units… • ALU power is a constant (based on input from industry) • Modified it to match our needs

Need to account for energy correctly: • “Worst case”, F(V,C,f) =~C*V^2*f • Every unit on every cycle… • No good • Depending on program behavior • the right way • Basic organization:

Used a typical wide-issue processor, assuming • 600 MHz, 32-bit • 32K L1 instruction cache, 32K L1 data cache • 512K L2 unified cache • 2 int ALUs, 1 FP adder, 1 FP multiplier • 3.3V • MIPS R10K like, mods from default SS • Major units to look at: • Instruction,data cache, ALU, branch predictor, RF,…

Some typical results • Power distribution among major internal units

Motivation cont’d • Now we can attack specific important sources • Instruction cache is one such unit • Reason: • Every cycle 4 32b instruction words need to be fetched • Next we discuss a hardware mechanism for reducing instruction cache energy consumption

Previous Work • Hasegawa 1995 - phased cache • Examine the tag and data fields in two separate phases • Reduce power consumption by 70% • Increase the average cache-access time by 100% • Inoue 1999 - set-associative, way-prediction cache • Speculatively selects one way before starting a normal access • On a way prediction hit, power is reduced by a factor of 4 • Increase the cache-access time on mispredictions • Lee 1999 - loop cache • Shut down main cache completely while executing tight program loops from the loop cache • Power savings vary with the application • No performance degradation

Our approach • Not all of the fetched instructions in a line are used • When a branch is taken – • the words after the branch till line end • When there is a target in a line • from the beginning of a line till the target • Save energy by fetching only useful instructions • Design a hardware mechanism (fetch predictor) that predicts which instructions are going to be used out of a cache line before that line is fetched • Selectively fetch only predicted useful instructions in each fetch cycle

Organization • Need a cache with an ability to fetch any consecutive sequence of instructions from a line • This has been implemented before • Su 1995 - divide the cache into subbanks, activated individually by a control vector • RS/6000 - cache organized as 4 separate arrays, each of which could use a different row address • Generate a control vector w/ a bit for each “bank”

Fetch Predictor • General idea: • Rely on branch predictor to get PC of next instruction • Build a fetch predictor on top of branch predictor to decide which instructions to fetch • Use branch misprediction detection mechanism and branch predictor update phase to update the fetch predictor

Some specifics • Predict for the next line to fetched • For a target in the next line - use address in BTB • Fetch from target on • For a branch in next line need a separate predictor • Before the line is fetched • Add update when branch predictor is updated • Initialize to fetch all words • Need to take care of a case when both branch and target are in the same line • AND control bit vectors

Experimental Setup • SimpleScalar extended with a fetch predictor • A simple power model: • Energy per cycle is proportional to the number of fetched instructions • Simulated a subset of Spec95 • 3 billion instructions executed in each • Direct mapped cache, with 4 and 8 instructions per line • Bimodal branch predictor

Summary of Results • Average power savings • Perfect predictor • 33%, between 8% and 55% for a 8-instr cache line • Fetch predictor • 25%, between 5% and 41% for a 8-instr cache line • Larger power savings for integer benchmarks than for the floating point ones

Conclusions • Contribution to power-aware hardware • a fetch predictor for fetching only useful instructions • Preliminary results • 5% of the total power, for an 8-instruction cache line (assumes I-cache consumes 20% of the total) • Advantage: No performance penalty!

Energy Efficient Instruction Cache for Wide-issue Processors

Energy Efficient Instruction Cache for Wide-issue Processors

Presentation Transcript

Instruction-Level Parallel Processors

Cache Utilization-Aware Scheduling for Multicore Processors

Adaptive Cache Compression for High-Performance Processors

Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors

Smart Cache Cleaning : Energy Efficient Vulnerability Reduction in Embedded Processors

Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors*

Macro instruction synthesis for embedded processors

Energy Efficient Designs with Wide Dynamic Range

Clustered Data Cache Designs for VLIW Processors

Kilo-instruction Processors

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors

Kilo-instruction Processors

Cache Memory Design for Network Processors

Multiple Instruction Issue

Kilo-instruction Processors

KILO-INSTRUCTION PROCESSORS

Instruction Issue Logic

Power Efficient Cache Coherence

Word Wide Cache

Cache Replacement in Modern Processors

Compressed Instruction Cache

Cache Coherence Techniques for Multicore Processors