David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

EECS 150 - Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~culler http://inst.eecs.berkeley.edu/~cs150

Overview • Review of Addition • Overflow • Multiplication • Further adder optimizations for multiplication • CLA in the large – parallel prefix

Review • Circuit design for unsigned addition • Full adder per bit slice • Delay limited by Carry Propagation • Ripple is algorithmically slow, but wires are short • Carry select • Simple, resource-intensive • Excellent layout • Carry look-ahead • Excellent asymptotic behavior • Great at the board level, but wire length effects are significant on chip • Digital number systems • How to represent negative numbers • Simple operations • Clean algorithmic properties • 2s complement is most widely used • Circuit for unsigned arithmetic • Subtract by complement and carry in • Overflow when cin xor cout of sign-bit is 1

-8 + 5 Computer Number Systems • Positional notation • Dn-1 Dn-2 …D0 represents Dn-1Bn-1 + Dn-2Bn-2 + …+ D0 B0 where Di { 0, …, B-1 } • 2s Complement • Dn-1 Dn-2 …D0 represents: - Dn-12n-1 + Dn-22n-2 + …+ D0 20 • MSB has negative weight -1 +0 -2 1111 0000 +1 1110 0001 -3 +2 1101 0010 -4 1100 +3 0011 -5 1011 0100 +4 1010 -6 0101 +5 1001 0110 -7 +6 1000 0111 -8 +7

2s Complement Overflow How can you tell an overflow occurred? Add two positive numbers to get a negative number or two negative numbers to get a positive number -1 -1 +0 +0 -2 -2 1111 0000 +1 1111 0000 +1 1110 1110 0001 0001 -3 -3 +2 +2 1101 1101 0010 0010 -4 -4 1100 +3 1100 +3 0011 0011 -5 -5 1011 1011 0100 +4 0100 +4 1010 1010 -6 -6 0101 0101 +5 +5 1001 1001 0110 0110 -7 -7 +6 +6 1000 0111 1000 0111 -8 -8 +7 +7 -7 - 2 = +7! 5 + 3 = -8!

2s comp. Overflow Detection 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 5 3 -8 -7 -2 7 Overflow Overflow 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 5 2 7 -3 -5 -8 No overflow No overflow Overflow occurs when carry in to sign does not equal carry out

2s Complement Adder/Subtractor A - B = A + (-B) = A + B + 1

Dedicated carry logic provides fast arithmetic carry capability for high-speed arithmetic functions. The Virtex-E CLB supports two separate carry chains, one per Slice. The height of the carry chains is two bits per CLB. The arithmetic logic includes an XOR gate and AND gate that allows a 2-bit full adder to be implemented within a slice. Cin to Cout delay = 0.1ns, versus 0.4ns for F to X delay. Adders on the Xilinx Virtex How do we map a 2-bit adder to one slice?

Time / Space (resource) Trade-offs • Carry select and CLA utilize more silicon to reduce time. • Can we use more time to reduce silicon? • How few FAs does it take to do addition?

Addition of 2 n-bit numbers: takes n clock cycles, uses 1 FF, 1 FA cell, plus registers the bit streams may come from or go to other circuits, therefore the registers may be optional. Requires controller What does the FSM look like? Implemented? Final carry out? A, B, and R held in shift-registers. Shift right once per clock cycle. Reset is asserted by controller. Bit-serial Adder lsb

Discussion • What is sign extension and why does it work? • Where is addition used in the project? • Where might you want more powerful arithmetic operations?

Announcements • Reading: 5.8 (4 pages!) • Digital Design in the news – from UCB • UC Berkeley is among six universities to be part of the program started by IBM Corp. and Google Inc. on college campuses to promote computer-programming techniques for clusters of processors known as "clouds". Cloud computing allows computers in remote data centers to run parallel, increasing their processing power. Each company will spend between $20 million and $25 million for hardware, software and services that can be used by computer-science professors and students.

Partial products 10001111 (143) Basic concept of multiplication • product of 2 n-bit numbers is an 2n-bit number • sum of n n-bit partial products • unsigned multiplicand multiplier 1101 (13) 1011 (11) 1101 * 1101 0000 1101

Combinational Multiplier: accumulation of partial products A1 B1 A1 B0 A0 B1 A0 B0 A0 B0 A3 B3 A2 B0 A2 B1 A1 B2 A0 B3 A2 B2 A2 B0 A1 B1 A0 B2 A3 B1 A2 B2 A1 B3 A3 B3 A3 B2 A2 B3 S7 S6 S4 S5 S3 S2 S1 S0

Array Multiplier Generates all n partial products simultaneously. Each row: n-bit adder with AND gates What is the critical path?

Sums each partial product, one at a time. In binary, each partial product is shifted versions of A or 0. Control Algorithm: 1. P  0, A  multiplicand, B  multiplier 2. If LSB of B==1 then add A to P else add 0 3. Shift [P][B] right 1 4. Repeat steps 2 and 3 n-1 times. 5. [P][B] has product. “Shift and Add” Multiplier • Cost  n,  = n clock cycles. • What is the critical path for determining the min clock period?

Speeding up multiplication is a matter of speeding up the summing of the partial products. “Carry-save” addition can help. Carry-save addition passes (saves) the carries to the output, rather than propagating them. Example: sum three numbers, 310 = 0011, 210 = 0010, 310 = 0011 310 0011 + 210 0010 c 0100 = 410 s 0001 = 110 310 0011 c 0010 = 210 s 0110 = 610 1000 = 810 Carry-save Addition carry-save add carry-save add carry-propagate add • In general, carry-save addition takes in 3 numbers and produces 2. • Whereas, carry-propagate takes 2 and produces 1. • With this technique, we can avoid carry propagation until final addition

When adding sets of numbers, carry-save can be used on all but the final sum. Standard adder (carry propagate) is used for final sum. Carry-save Circuits

Array Mult. using Carry-save Addition Fast carry-propagate adder

Add CPA Another Representation Building block: full adder + and 4 x 4 array of building blocks

Carry-save Addition CSA is associative and commutative. For example: (((X0 + X1)+X2 )+X3 ) = ((X0 + X1)+(X2 +X3 )) • A balanced tree can be used to reduce the logic delay. • This structure is the basis of the Wallace Tree Multiplier. • Partial products are summed with the CSA tree. Fast CPA (ex: CLA) is used for final sum. • Multiplier delay  log3/2N + log2N

Signed Multiplier Signed Multiplication: Remember for 2’s complement numbers MSB has negative weight: ex: -6 = 110102 = 0•20 + 1•21 + 0•22 + 1•23 - 1•24 = 0 + 2 + 0 + 8 - 16 = -6 • Therefore for multiplication: a) subtract final partial product b) sign-extend partial products • Modifications to shift & add circuit: a) adder/subtractor b) sign-extender on P shifter register

Signed multiplication • product of 2 n-bit numbers is an 2n-bit number • sum of n n-bit partial products multiplicand multiplier 1101 (-3) 1011 (-5) 1101 * 1111 Note: 2s complement Sign extension + +(-3) 111 +(-6) + 11010 + 00 000000 - -(-24) 1 1101000 (15) 00001111

Implicit Sign extension - - - - Signed Array Multiplier b3 0 b2 0 b1 0 b0 0 a0 0 P0 a1 0 P1 a2 0 P2 a3 0 P3 P7 P6 P5 P4

Signed extend partial product at each stage Final step is a subtract “Shift and Add” Signed Multiplier

a b ci ci+1 s Carry Look-ahead Adders • In general, for n-bit addition best we can achieve is delay  log(n) • How do we arrange this? (think trees) • First, reformulate basic adder stage: carry “kill” ki = ai’ bi’ carry “propagate” pi = ai bi carry “generate” gi = ai bi ci+1 = gi + pici si = pi ci

cin pi pi+1 pi+k gi gi+1 gi+k Carry Look-ahead Adders – in blocks • “Group” propagate and generate signals: • P true if the group as a whole propagates a carry to cout • G true if the group as a whole generates a carry • Group P and G can be generated hierarchically. P = pi pi+1 … pi+k G = gi+k + pi+kgi+k-1 + … + (pi+1pi+2 … pi+k)gi cout Cout = G + PCin

c0 Carry Look-ahead Adders 9-bit Example of hierarchically generated P and G signals: a0 Pa b0 a1 a b1 a2 Ga P = PaPbPc b2 c3 = Ga + Pac0 a3 Pb b3 a4 b b4 a5 Gb b5 c6 = Gb + Pbc3 a6 G = Gc + PcGb + PbPcGa Pc b6 a7 c b7 a8 c9 = G + Pc0 Gc b8

x BA 30 74 30 74 30 30 BA Ax 10 BAx 54 54 32 10 76 54 32 76 10 30 10 54 6 0 2 4 0 4 2 1 6 5 3 7 76 54 32 10 74 64 30 20 70 60 50 40 Parallel Prefix (generalizing CLA) • Compute all the prefixes Fi = Fi-1 op Fi-2 op … op F0 • Assume associative and commutative 70 B A

c0 cin P = PaPb G = Gb + GaPb Cout = G + cinP Pa,Ga P,G Pb,Gb cout c0 a0 ci p = a  b g = ab s = p  ci ci+1 = g + cip p,g b0 ai s0 P,G bi p,g c1 c0 a1 si b1 s1 ci+1 c2 a2 b2 s2 c3 c0 a3 b3 8-bit Carry Look-ahead Adder s3 P,G c4 a4 b4 s4 c8 c5 c0 a5 b5 s5 c6 a6 b6 s6 c7 a7 b7 s7

Summary • 2 complement number systems • Algebraic and corresponding bit manipulations • Overflow detection • Signficance of “sign bit” -2n-1 • Carry look ahead is form a parallel prefix • Time / Space tradeoffs • Bit serial adder • Binary Multiplication algorithm • Array multiplier • Serial multiply (with bit parallel adder) • Signed multiplication • Sign extend multipicand • Sign bit of multiplier treated as subtract

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley