270 likes | 394 Views
Variable Word Width Computation for Low Power. By Bret Victor Sayf Alalusi. Motivation. 32 bit architecture required for most general purpose computing However, many applications don’t need a full 32 bit data word: Video: 24 bit Audio: 16 bit Text: 8 bit Logic: 1 bit
E N D
Variable Word Width Computation for Low Power By Bret Victor Sayf Alalusi
Motivation • 32 bit architecture required for most general purpose computing • However, many applications don’t need a full 32 bit data word: • Video: 24 bit • Audio: 16 bit • Text: 8 bit • Logic: 1 bit • How can we exploit this to save power?
Possibilities • Architecture that supports 32, 24, 16, 8, and 1 bit operations? Or some subset? • Switch processor between modes, or specify width for each instruction? Global or distributed control? • Gated clocks? Don’t drive unused outputs? Power down unused blocks?
Implementation • Based on MIPS architecture and ISA • Two widths: 16 bit and 32 bit • Width chosen on instruction-by-instruction basis. • Flag bit in instruction word selects width • Modified ISA: • arithmetic: add16, add32; mul16, mul32 • logical: and16, and32 • memory: lw16, lw32; sw16, sw32 • branch compare: beq16, beq32
Energy • Energy consumption occurs when a node transitions, and is proportional to the capacitance at that node. • Prevent nodes from transitioning unnecessarily. • Energy savings can be calculated by adding all the capacitance that is switching.
Where We Save Energy • Our design saves energy over a traditional processor in three main areas: • Clock and control line energy • HWTE (High Word Transition Energy) • Memory control energy • We will see these three areas as we step through the pipeline.
= + srcA srcB dest data outA outB MUX addr wr data rd data Pipeline Overview branch address: 32 MUX PC + 4: 32 immed: 16 32 32 branch offset +4 dest reg: 5 5 32 I$ 5 PC 32 32 32 5 32 IF/ID ID/EX reg A MUX ALU fwd from MEM ALU result: 32 32 32 fwd from WB 32 32 32 32 32 reg B MUX 32 32 fwd from MEM fwd from WB immed data for SW: 32 dest reg: 5 5 dest reg: 5 EX/MEM MEM/WB
IF Stage branch address: 32 MUX PC + 4: 32 32 +4 32 I$ PC 32 32 IF/ID • Instruction words and addresses must be 32 bits. • Can’t modify much.
+ = srcA srcB dest data outA outB ID Stage branch address: 32 • We can: • gate the clocks of the pipeline register • only drive high words out of register file if 32 bit operation PC + 4: 32 immed: 16 32 branch offset dest reg: 5 5 32 5 32 5 32 IF/ID ID/EX
Pipeline Register (ID) WidthGatedClock UngatedClock • Fit gating into clock distribution network. • Little energy overhead and helps control skew. • On ID stage, gating reduces clock energy by: • 56% on 16-bit operations • 19% on 32-bit non-immediate operations reg A: high 16 Clock UngatedClock reg A: low 16 reg B: high 16 reg B: low 16 C WidthGatedClock Q Width D destReg: 5 ImmedGatedClock (from instruction word) immed: 16
Register File Read Port (ID) • Decoder selects register to drive output bus. • We add one AND gate per register. • Switching capacitance dominated by output bus. • 16 bit operation takes 50% less energy than 32 bit operation.... • Not necessarily savings! D E C O D E R Width 16 N Reg 0: high 16 16 N Reg 0: low 16 Width 16 N Reg 1: high 16 16 N Reg 1: low 16
EX Stage reg A • Modify the ALU to perform 16 bit operations. • Prevent the high word output of the MUXes from changing on 16 bit operations. • Gate the clock of the pipeline register: • Only latch high word of ALU result on 32 bit operations • Only latch reg B on “store word” operations MUX ALU fwd from MEM 32 fwd from WB 32 reg B MUX 32 fwd from MEM fwd from WB immed data for SW: 32 dest reg: 5
Logical Inst.’s (EX) X0 ------- Y0 ------- e.g. X AND Y • Just don’t let the unused bits (high 16) transition • If they don’t transition, they will not drive the next stage either. • 50% less energy X1 ------- Y1 ------- X31 ------ Y31 ------
Adder (EX) 0 . . 3 A0 B0 … … An Bn 16 . . 19 • The 4CLA blocks just get replicated for the number of bits, but the upper level CLA structure will grow with the number of bits. • 16 bits: 58% less energy Upper Level CLA Generation 4 . . 7 20 . . 23 8 . . 11 24 . . 27 S0 Sn 12 . . 15 28 . . 31
Multiplier (EX) 32 x 32bit adds 32 x 32bit reg. writes 32 shifts In 32 cycles Vs. 16 x 16bit adds 16 x 16bit reg. writes 16 shifts In 16 cycles • Multiply complexity grows as N2, so a 16 bit multiply takes 77% less energy. • Even if upper 16 bits = 0, a 32 bit multiply does 16 extra shifts.
HWTE • Two types of data in 16 bit application: • Computational data (16-bit): high word = 0 • Pointers and addresses (32-bit): high word = C • Assume C “mostly constant” (memory accesses mostly in 64K block) • Traditional processor only consumes more datapath energy than our processor when transitioning between these data types. • HWTE = High Word Transition Energy
HWTE • With such a model, our processor effectively only excecutes “16 bit operations”. • Traditional processor excecutes “32 bit operations” only when transitioning between data types. • E32 = energy of 32 bit operation • E16 = energy of 16 bit operation • N = average number of consecutive instructions that use the same data type • HWTE = ( E32 - E16 ) / N
Barrel Shifter (EX) A3 B3 • Big win will come from not driving the control lines to the upper 16 bits. • Save about 50% in energy A2 B2 A1 B1 A0 B0 SH0 SH1 SH2 SH3
addr wr data rd data MEM Stage • This is a big, regular memory (SRAM) structure that can easily be segmented into blocks. • Exploit this fact ALU result: 32 32 32 32 dest reg: 5
DCache (MEM) • 2-way set associative, write-back • Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data values are aligned on 16b boundaries, 32 on 32b. Width Block # Only drive the word line that you need!
DCache (MEM) • Only drive the word lines that are needed. • Need a little bit of logic to figure out what the correct lines are, but large capacitance of WL dominates. • Block size is larger for 16 bit values, better exploits spatial locality • Associativity does not change from 16 bit to 32 bit word lengths • Energy savings: 50% • Control Line Savings, no HWTE!
srcA srcB dest data outA outB MUX WB Stage Dest reg: 5 • On a 16 bit operation, we can: • Only drive the low word out of the MUX • Capacitive load on register write port is large • Driving 16 bits out of the MUX consumes 50% less energy than driving 32 bits… HWTE formula applies. • Only latch the low word into the register? Mem data: 32 5 ALU result: 32 32 MEM/WB
Reg. File Write Port (WB) Write HiWrite • We can add one AND gate for each register. • But 16 bit write uses same amount of clock energy as 32 bit write without modifications. • Little savings from not writing into the register, because the high word would not change in a 16 bit application. • Not worth it! Width D E C O D E R HiWrite C Reg 0: high 16 D 16 Write C Reg 0: low 16 D 16 HiWrite C Reg 1: high 16 D 16 Write C Reg 1: low 16 D 16
Summary • Typical power distribution in core (non-memory): • ALU: 34% x 66% • I-decode: 23% x 100% • Register file: 13% x 66% • Clock: 10% x 50% • Shifter: 11% x 50% • Pipeline: 9% x 74% • Core energy reduced by 29%.
Summary • Typical power distribution in memory: • Instruction cache 60% x 100% • Data cache 40% x 50% • Cache energy reduced by 20%. • Total processor power consumption: • Cache 66% x 80% • Core 33% x 71% • Total energy reduced by 24% when executing a 16 bit application.
Conclusions • Primary drawback is modification of ISA. • Energy savings are reasonable. • Our modifications are fairly easy to implement, and can be fit into existing processor designs with minimal area increase.
Where do we go from here? • More accurate capacitance models and SPICE simulation • More accurate models of instruction mix