280 likes | 289 Views
This lecture covers the recap of reconfigurable computing, LUT mapping techniques, and different adder designs for FPGA arithmetic. It also discusses the performance and optimization of carry bypass, carry select, pipelining, and other techniques.
E N D
CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic
Quick Points • HW #1 due Thursday at 12:00pm • Any comments? CprE 583 – Reconfigurable Computing
Recap • Cluster size of N = [6-8] is good, K = [4-5] CprE 583 – Reconfigurable Computing
LUT Mapping Techniques CprE 583 – Reconfigurable Computing
LUT Mapping Techniques (cont.) CprE 583 – Reconfigurable Computing
LUT Mapping Techniques (cont.) CprE 583 – Reconfigurable Computing
Outline • Recap • Motivation • Carry / Cascade Logic • Addition • Ripple Carry • Carry Bypass • Carry Select • Carry Lookahead • Basic Multipliers CprE 583 – Reconfigurable Computing
(1, 2) (1) A B A B Op ALU 2-LUT Out Out Motivation • Traditional microprocessors, DSPs, etc. don’t use LUTs • Instead use a w-bit Arithmetic and Logic Unit (ALU) • Carry connections are hard-wired • No switches, no stubs, short wires (2) (1) AND2 OR2 XOR2 A B Cin 3-LUT 3-LUT Sum Cout / Cin A B (2) ADD SUB CMP 3-LUT 3-LUT Sum Cout CprE 583 – Reconfigurable Computing
Adder Delays • Assuming a ripple-carry adder: • 32-bit ALU delay – 6ns • 32-cascaded 4-LUTs – 32 x 2.5ns = 80ns • Motivates extra hardware to accelerate carry operations Compare: 32-bit ALU (0.6λ) 4-LUT delay 16 ns Area optimized 2.0 ns Logic delay 6 ns Delay optimized 0.5 ns Single channel delay 2.5 ns per bit CprE 583 – Reconfigurable Computing
Altera Flex 8000 Carry Chain CprE 583 – Reconfigurable Computing
Xilinx XC4000 Carry Chain CprE 583 – Reconfigurable Computing
Cascades • Large fanin operations (reductions): • Decoding • Matching • Completion detection • Many-to-one reductions • Combining logic is simple • Makes use of dedicated paths CprE 583 – Reconfigurable Computing
Altera Cascade Logic • LE delay – 2.4 ns • Cascade delay – 0.6 ns CprE 583 – Reconfigurable Computing
Why Look at Arithmetic? • Parallelization • Specialization • Architecture • Size • Inputs • Adder problem – delay grows linearly with bit width • Solutions for larger adders: • Pipelining • Carry bypass • Carry select CprE 583 – Reconfigurable Computing
Adder Pipelining • Not as practical in ASIC world (registers are expensive) • Registers essentially “free” in FPGA logic blocks A1 B1 A2 B2 A3 B3 + Cout S1 + Cout S2 + S3 Cout CprE 583 – Reconfigurable Computing
Carry Bypass Adders • If all the propagates are 1 (P0P1P2P3 = 1) then Cout3 = Cin • Pi = Ai xor Bi • Skip all the carry logic • Inexpensive A0 B0 A1 B1 A2 B2 A3 B3 Cout0 Cout1 Cout2 + + + + Cin Cout3 0 1 P0P1P2P3 CprE 583 – Reconfigurable Computing
Carry Bypass Performance • Small hardware cost: • 16-bit add – 4 CLB overhead • 32-bit add – 9 CLB overhead • Delay growth still linear, smaller slope Ripple Carry Carry Bypass Delay N-bit Adder CprE 583 – Reconfigurable Computing
Carry Select Adders • Precompute addition value for (Cin = 0) case and (Cin = 1) case • Use mux to select between two with actual Cin value • Cost of this approach? A0 B0 A1 B1 A2 B2 A3 B3 + + + + Cin A4 B4 A5 B5 A6 B6 A7 B7 + + + + 0 0 S4-7 1 + + + + 1 A4 B4 A5 B5 A6 B6 A7 B7 CprE 583 – Reconfigurable Computing
1 1 1 1 0 0 0 0 Linear Carry-Select • Adder delay = w, mux delay = 1 A31-24 A23-16 A15-8 A7-0 B31-24 B23-16 B15-8 B7-0 + + + + 0 0 0 0 + + + + 1 1 1 1 t8 t8 t8 t8 t8 t8 t8 t8 t11 t10 t9 t12 S31-24 S23-16 S15-8 S7-0 CprE 583 – Reconfigurable Computing
1 1 1 1 1 1 0 0 0 0 0 0 Square-Root Carry Select • Each carry arrives when the corresponding sum is ready A31-30 A29-22 A21-15 A14-9 A8-4 A3-0 B31-30 B29-22 B21-15 B14-9 B8-4 B3-0 + + + + + + 0 0 0 0 0 0 + + + + + + 1 1 1 1 1 1 t8 t8 t7 t7 t6 t6 t5 t5 t4 t4 t9 t8 t7 t6 t5 t10 S31-30 S29-22 S21-15 S14-9 S8-4 S3-0 CprE 583 – Reconfigurable Computing
A0 A2 A3 A1 HA HA C3 S0 S2 S3 S1 Constant Addition • If one operand is constant: • More speed? • Less hardware? A0 0 1 0 1 A1 A2 A3 HA FA FA FA C3 S0 S1 S2 S3 CprE 583 – Reconfigurable Computing
Multiplication • Shift and add operations • Need N bit adder, M cycles 42 = 101010 Multiplicand (N bits) x 10 = x 1010 Multiplier (M bits) 000000 101010 Partial products 000000 + 101010 420 = 111001110 Product (N+M bits) CprE 583 – Reconfigurable Computing
X3 X2 X1 X0 Array Multiplier Y0 • Area – N x M cells • Delay – O(N+M) Y1 X2 X1 X3 X0 + + + + Y2 X2 X1 X3 X0 + + + + Y3 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing
X3 X2 X1 X0 Carry-Save Y0 Y1 X2 X1 X3 X0 + + + + Y2 + + + + Y3 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing
Multiplier Pipelining • Register cost: • Multiplicand – (N bits/stage x M stages) • Multiplier – (M2 + M) / 2 bits • Early output values – (M2 + M) / 2 bits • Total – M x (N + M + 1) bits • Critical path = max: • DFF + FA + setup • Bottom-level adder CprE 583 – Reconfigurable Computing
Constant Multiplier Y0=0 • Can greatly reduce the number of adders • Removes all and gates Y1=1 X2 X1 X3 X0 Y2=0 Y3=1 X2 X1 X3 X0 + + + + Z2 Z1 Z0 CprE 583 – Reconfigurable Computing
LUT-based Constant Multipliers • k-LUT can perform constant multiply of k-bit number • Break operand into k-bit quantities • Example: 8-bit x 8-bit constant, k=4 10101011 x NNNNNNNN AAAAAAAAAAA (N * 1011 (LSN)) + BBBBBBBBBBBBBBB (N * 1010 (MSN)) SSSSSSSSSSSSSSS Product CprE 583 – Reconfigurable Computing
Summary • Latency overhead of programmable logic • Several approaches to reducing design latency: • Fast carry • Cascade • Hardwired connections • Multiplier optimization goals different from adder • Other techniques: • Logarithmic v. linear (Wallace Tree multiplier) • Data encoding (Booth’s multiplier) CprE 583 – Reconfigurable Computing