CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic

Quick Points • HW #1 due Thursday at 12:00pm • Any comments? CprE 583 – Reconfigurable Computing

Recap • Cluster size of N = [6-8] is good, K = [4-5] CprE 583 – Reconfigurable Computing

LUT Mapping Techniques CprE 583 – Reconfigurable Computing

LUT Mapping Techniques (cont.) CprE 583 – Reconfigurable Computing

Outline • Recap • Motivation • Carry / Cascade Logic • Addition • Ripple Carry • Carry Bypass • Carry Select • Carry Lookahead • Basic Multipliers CprE 583 – Reconfigurable Computing

(1, 2) (1) A B A B Op ALU 2-LUT Out Out Motivation • Traditional microprocessors, DSPs, etc. don’t use LUTs • Instead use a w-bit Arithmetic and Logic Unit (ALU) • Carry connections are hard-wired • No switches, no stubs, short wires (2) (1) AND2 OR2 XOR2 A B Cin 3-LUT 3-LUT Sum Cout / Cin A B (2) ADD SUB CMP 3-LUT 3-LUT Sum Cout CprE 583 – Reconfigurable Computing

Adder Delays • Assuming a ripple-carry adder: • 32-bit ALU delay – 6ns • 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns • Motivates extra hardware to accelerate carry operations Compare: 32-bit ALU (0.6λ) 4-LUT delay 16 ns Area optimized 2.0 ns Logic delay 6 ns Delay optimized 0.5 ns Single channel delay 2.5 ns per bit CprE 583 – Reconfigurable Computing

Altera Flex 8000 Carry Chain CprE 583 – Reconfigurable Computing

Xilinx XC4000 Carry Chain CprE 583 – Reconfigurable Computing

Cascades • Large fanin operations (reductions): • Decoding • Matching • Completion detection • Many-to-one reductions • Combining logic is simple • Makes use of dedicated paths CprE 583 – Reconfigurable Computing

Altera Cascade Logic • LE delay – 2.4 ns • Cascade delay – 0.6 ns CprE 583 – Reconfigurable Computing

Why Look at Arithmetic? • Parallelization • Specialization • Architecture • Size • Inputs • Adder problem – delay grows linearly with bit width • Solutions for larger adders: • Pipelining • Carry bypass • Carry select CprE 583 – Reconfigurable Computing

Adder Pipelining • Not as practical in ASIC world (registers are expensive) • Registers essentially “free” in FPGA logic blocks A1 B1 A2 B2 A3 B3 + Cout S1 + Cout S2 + S3 Cout CprE 583 – Reconfigurable Computing

Carry Bypass Adders • If all the propagates are 1 (P0P1P2P3 = 1) then Cout3 = Cin • Pi = Ai xor Bi • Skip all the carry logic • Inexpensive A0 B0 A1 B1 A2 B2 A3 B3 Cout0 Cout1 Cout2 + + + + Cin Cout3 0 1 P0P1P2P3 CprE 583 – Reconfigurable Computing

Carry Bypass Performance • Small hardware cost: • 16-bit add – 4 CLB overhead • 32-bit add – 9 CLB overhead • Delay growth still linear, smaller slope Ripple Carry Carry Bypass Delay N-bit Adder CprE 583 – Reconfigurable Computing

Carry Select Adders • Precompute addition value for (Cin = 0) case and (Cin = 1) case • Use mux to select between two with actual Cin value • Cost of this approach? A0 B0 A1 B1 A2 B2 A3 B3 + + + + Cin A4 B4 A5 B5 A6 B6 A7 B7 + + + + 0 0 S4-7 1 + + + + 1 A4 B4 A5 B5 A6 B6 A7 B7 CprE 583 – Reconfigurable Computing

1 1 1 1 0 0 0 0 Linear Carry-Select • Adder delay = w, mux delay = 1 A31-24 A23-16 A15-8 A7-0 B31-24 B23-16 B15-8 B7-0 + + + + 0 0 0 0 + + + + 1 1 1 1 t8 t8 t8 t8 t8 t8 t8 t8 t11 t10 t9 t12 S31-24 S23-16 S15-8 S7-0 CprE 583 – Reconfigurable Computing

1 1 1 1 1 1 0 0 0 0 0 0 Square-Root Carry Select • Each carry arrives when the corresponding sum is ready A31-30 A29-22 A21-15 A14-9 A8-4 A3-0 B31-30 B29-22 B21-15 B14-9 B8-4 B3-0 + + + + + + 0 0 0 0 0 0 + + + + + + 1 1 1 1 1 1 t8 t8 t7 t7 t6 t6 t5 t5 t4 t4 t9 t8 t7 t6 t5 t10 S31-30 S29-22 S21-15 S14-9 S8-4 S3-0 CprE 583 – Reconfigurable Computing

A0 A2 A3 A1 HA HA C3 S0 S2 S3 S1 Constant Addition • If one operand is constant: • More speed? • Less hardware? A0 0 1 0 1 A1 A2 A3 HA FA FA FA C3 S0 S1 S2 S3 CprE 583 – Reconfigurable Computing

Multiplication • Shift and add operations • Need N bit adder, M cycles 42 = 101010 Multiplicand (N bits) x 10 = x 1010 Multiplier (M bits) 000000 101010 Partial products 000000 + 101010 420 = 111001110 Product (N+M bits) CprE 583 – Reconfigurable Computing

X3 X2 X1 X0 Array Multiplier Y0 • Area – N x M cells • Delay – O(N+M) Y1 X2 X1 X3 X0 + + + + Y2 X2 X1 X3 X0 + + + + Y3 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

X3 X2 X1 X0 Carry-Save Y0 Y1 X2 X1 X3 X0 + + + + Y2 + + + + Y3 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

Multiplier Pipelining • Register cost: • Multiplicand – (N bits/stage x M stages) • Multiplier – (M2 + M) / 2 bits • Early output values – (M2 + M) / 2 bits • Total – M x (N + M + 1) bits • Critical path = max: • DFF + FA + setup • Bottom-level adder CprE 583 – Reconfigurable Computing

Constant Multiplier Y0=0 • Can greatly reduce the number of adders • Removes all and gates Y1=1 X2 X1 X3 X0 Y2=0 Y3=1 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing

LUT-based Constant Multipliers • k-LUT can perform constant multiply of k-bit number • Break operand into k-bit quantities • Example: 8-bit x 8-bit constant, k=4 10101011 x NNNNNNNN AAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSS Product CprE 583 – Reconfigurable Computing

Summary • Latency overhead of programmable logic • Several approaches to reducing design latency: • Fast carry • Cascade • Hardwired connections • Multiplier optimization goals different from adder • Other techniques: • Logarithmic v. linear (Wallace Tree multiplier) • Data encoding (Booth’s multiplier) CprE 583 – Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing