160 likes | 375 Views
Distributed Arithmetic. Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources. Objective. Distributed arithmetic What ? Where ? How ?. What is DA?. Multiplication using LUT Used to implement multipliers in LUT rich FPGAs.
E N D
Distributed Arithmetic Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources
Objective • Distributed arithmetic • What ? • Where ? • How ?
What is DA? • Multiplication using LUT • Used to implement multipliers in LUT rich FPGAs
Twos Complement Multiplication One bit at a time:
SDA 1-Tap FIR Filter 1 Z-1 A0 00000...0 0 C0 1 LUT contains two locations N BITS WIDE SAMPLE DATA Partial Product ROM A0 +/- X0 Parallel to serial converter Scaling Accumulator
Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products Distributed Arithmeticfor a 2-Tap Filter -23 22 21 20 -23 22 21 20 C0 = 1 0 0 1 (-7) C1 = 0 1 1 0 ( 6) X X0 = 0 1 1 1 ( 7) X X1 = 0 1 0 1 ( 5) + + + + ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 (-1) (-14) (-4) (0) (-19) (-49) ( 30) (Serial-Data / Tap-Parallel Multiply) = Sign Extension
SDA 2-Tap FIR Filter 0000...0 C0 Z-1 C1 C0 + C1 1 00 01 10 11 LUT contains all possible sums of the partial products N BITS WIDE SAMPLE DATA Partial Product ROM A0 X0 +/- A1 X1 Scaling Accumulator
SDA 4-Tap FIR Filter Z-1 +/- Scaling Accumulator N BITS WIDE SAMPLE DATA Partial Product ROM A0 0000...0 X0 C0 1 + A1 0000...0 X1 C1 1 + A2 0000...0 X2 C2 1 + A3 0000...0 X3 C3
SDA 8-Tap FIR Filter 1 1 1 1 1 1 1 Z-1 N BITS WIDE SAMPLE DATA A0 Partial Product ROM X0 A1 X1 A2 Pre-Adder X2 A3 X3 + +/- A0 X4 Partial Product ROM Scaling Accumulator A1 X5 A2 X6 4 -input LUT contains all possible sums of the partial products A3 X7
60 Single MAC DA FIR B=8 50 DA FIR B=12 40 DA FIR B=16 Sample Rate (MSPS) 30 Serial FPGA FIR 20 10 0 0 50 100 150 200 250 Xilinx DA FIR Performance 6000 Dual MAC DA FIR B=8 5000 DA FIR B=12 4000 DA FIR B=16 3000 Performance (MMACs/s) Serial FPGA FIR 2000 1000 0 0 50 100 150 200 250 Filter Length (Taps) Filter Length (Taps) fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA
Trade Clock Cycles for Logic Area Trade Clock Cycles for Logic Area Multi bits per clock cycle 160Ms/s 20Ms/s b7 b7 b7 Serial-DA Parallel-DA b4 b3 b0 Hardware Over-sampling = 4 b0 Hardware Over-sampling = 8 Hardware Over-sampling = 2 b0 b0 b7 b3 Hardware Over-sampling = 1 b4 b0 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample The sample is processed in parallel 8 bits per clock cycle The sample is serialized and processed 4 bits per clock cycle b0
Conclusion • Efficiency of computation • Slow as its bit serial • Memory requirements
References • The role of Distributed Arithmetic in FPGA based signal processing, www.xilinx.com