190 likes | 352 Views
Energy Efficient Instruction Cache for Wide-issue Processors . Alex Veidenbaum Information and Computer Science University of California, Irvine. Motivation. Power dissipation is a serious problem It is important for both high performance, e.g. COMPAQ Alpha, MIPS R10K, Intel x86,
E N D
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine
Motivation • Power dissipation is a serious problem • It is important for both high performance, • e.g. COMPAQ Alpha, MIPS R10K, Intel x86, • As well as embedded processors • E.g. ARM, Strongarm, etc… • Our current research is focused on reducing the average energy consumption of high-performance embedded processors • MIPS R20K, IBM/Motorola PowerPC, Mobile Pentium,..
We want to address the energy consumption via • architecture • compiler • Technology is, to the first order, an orthogonal parameter and is NOT considered • So the first question is • What are the major sources of energy dissipation? • Hard to find this info
Experimental Setup • We started with Wattch (Princeton) • architectural-level power simulator • based on SimpleScalar (sim-outorder) simulator • can specify technology, clock and voltage • computes power for major internal units of processor • parameterizable models for most of the components • mostly memory-based units… • ALU power is a constant (based on input from industry) • Modified it to match our needs
Need to account for energy correctly: • “Worst case”, F(V,C,f) =~C*V^2*f • Every unit on every cycle… • No good • Depending on program behavior • the right way • Basic organization:
Used a typical wide-issue processor, assuming • 600 MHz, 32-bit • 32K L1 instruction cache, 32K L1 data cache • 512K L2 unified cache • 2 int ALUs, 1 FP adder, 1 FP multiplier • 3.3V • MIPS R10K like, mods from default SS • Major units to look at: • Instruction,data cache, ALU, branch predictor, RF,…
Some typical results • Power distribution among major internal units
Motivation cont’d • Now we can attack specific important sources • Instruction cache is one such unit • Reason: • Every cycle 4 32b instruction words need to be fetched • Next we discuss a hardware mechanism for reducing instruction cache energy consumption
Previous Work • Hasegawa 1995 - phased cache • Examine the tag and data fields in two separate phases • Reduce power consumption by 70% • Increase the average cache-access time by 100% • Inoue 1999 - set-associative, way-prediction cache • Speculatively selects one way before starting a normal access • On a way prediction hit, power is reduced by a factor of 4 • Increase the cache-access time on mispredictions • Lee 1999 - loop cache • Shut down main cache completely while executing tight program loops from the loop cache • Power savings vary with the application • No performance degradation
Our approach • Not all of the fetched instructions in a line are used • When a branch is taken – • the words after the branch till line end • When there is a target in a line • from the beginning of a line till the target • Save energy by fetching only useful instructions • Design a hardware mechanism (fetch predictor) that predicts which instructions are going to be used out of a cache line before that line is fetched • Selectively fetch only predicted useful instructions in each fetch cycle
Organization • Need a cache with an ability to fetch any consecutive sequence of instructions from a line • This has been implemented before • Su 1995 - divide the cache into subbanks, activated individually by a control vector • RS/6000 - cache organized as 4 separate arrays, each of which could use a different row address • Generate a control vector w/ a bit for each “bank”
Fetch Predictor • General idea: • Rely on branch predictor to get PC of next instruction • Build a fetch predictor on top of branch predictor to decide which instructions to fetch • Use branch misprediction detection mechanism and branch predictor update phase to update the fetch predictor
Some specifics • Predict for the next line to fetched • For a target in the next line - use address in BTB • Fetch from target on • For a branch in next line need a separate predictor • Before the line is fetched • Add update when branch predictor is updated • Initialize to fetch all words • Need to take care of a case when both branch and target are in the same line • AND control bit vectors
Experimental Setup • SimpleScalar extended with a fetch predictor • A simple power model: • Energy per cycle is proportional to the number of fetched instructions • Simulated a subset of Spec95 • 3 billion instructions executed in each • Direct mapped cache, with 4 and 8 instructions per line • Bimodal branch predictor
Summary of Results • Average power savings • Perfect predictor • 33%, between 8% and 55% for a 8-instr cache line • Fetch predictor • 25%, between 5% and 41% for a 8-instr cache line • Larger power savings for integer benchmarks than for the floating point ones
Conclusions • Contribution to power-aware hardware • a fetch predictor for fetching only useful instructions • Preliminary results • 5% of the total power, for an 8-instruction cache line (assumes I-cache consumes 20% of the total) • Advantage: No performance penalty!