Optimizing Multipliers for the CPU: A ROM based approach

Optimizing Multipliers for the CPU: A ROM based approach Michael Moeng Jason Wei Electrical Engineering and Computer ScienceUniversity of California: Berkeley

Problem • Many power-limited applications for CPU • Media/Graphics • Portable applications • Investigating the impact of different multiplier designs on power and performance of CPU: • SimpleScalar to model CPU and benchmarks • Modify SimpleScalar multiplier cycle times to model different multiplier architectures

Array Multipliers • AND function to multiply bits • Critical path in carry-chain

Wallace Multipliers • Critical path shortened • Final Adder still needed to combine partial products • Power consumption approximately the same as Array Multiplier

Modified Booth Representation • 3 bits examined at a time, even values of i traversed • Reduces partial products by half • However, overhead required to generate signals, MUXes • Y-1 = 0 • Examples: 1 1 1 1 [0] 0 -1 0 1 1 0 [0] 2 -2

Read Only Memory • Desirable because of low power requirements • Con stems from read delay, size 240 MHz -> 4.2 ns delay Consumes 3.24mW at 100MHz (10ns delay)

ROM-based multipliers • ROM-based multipliers attractive • Issue of space • 32-bit multiplier requires 232*232*64 bits—unrealistic • Techniques to reduce table sizes • Karatsuba Algorithm: • A=A31-16A15-0, B=B31-16B15-0 • A*B=A31-16B31-16<<32+A15-0B31-16<<16+A31-16B15-0<<16+A15-0B15-0 • Reduces table size to 216*216*32 bits, but requires 4 lookups and 3 additions. • Using multiple, parallel lookups still uses fewer bits than regular table lookup

ROM-based multipliers cont. • Vinnakota’s approach – Use tables of squares • Let x = floor([A + B]/2) and y = floor([A- B]/2) • If A0 xor B0 = 0: A*B = x2-y2 • If A0 xor B0 = 1: A*B = x2-y2 +B • Reduces table size to 232 * 64 bits, further reducible with split-tables (introduced later), requires 2 table lookups and 3 (or 4) additions • Hybrid approach: • Use tables of squares to find partial products for Karatsuba algorithm

Proposed Implementation A=A1A0 B=B1B0 x11, y11… 216* 32bit ROM 216* 32bit ROM x112, y112… A1*B1, A1*B0 …

Results • Most of the SPEC2000 benchmarks exhibited little or no performance loss (< .5%) from extra multiplier cycles: art, bzip*, gcc, gzip*, ijpeg, li, mcf, mesa, parser*, vpr • : Significant • * : Possibly significant • Of applications that did experience a drop in performance (extra cycles): • go.outorder (6.41%) – go playing program • m88ksim (5.39%) – chip simulator • perl (0.72%) – perl interpreter • vortex (2.33%) – Object Orientated Database

Further Work • Measurements: • Accurate power measurements • More specific benchmarks—targeting multimedia • Optimizations: • Tables: Vinnakota’s split-table work • If A, B share lower k bits, A2, B2 share lower k+1 bits. • Can change 2N*N table to 2N*(N-[k+1]) and 2k*(k+1) tables. • Gives somewhat faster lookups and lower memory requirements. • Adders: • Adders can be optimized, final 64-bit additions are more like 48-bit additions. • Pipelining multiplication operations can occur in up to 3 stages.

Optimizing Multipliers for the CPU: A ROM based approach