Distributed Arithmetic

Distributed Arithmetic Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources

Objective • Distributed arithmetic • What ? • Where ? • How ?

What is DA? • Multiplication using LUT • Used to implement multipliers in LUT rich FPGAs

Twos Complement Multiplication One bit at a time:

SDA 1-Tap FIR Filter 1 Z-1 A0 00000...0 0 C0 1 LUT contains two locations N BITS WIDE SAMPLE DATA Partial Product ROM A0 +/- X0 Parallel to serial converter Scaling Accumulator

Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products Distributed Arithmeticfor a 2-Tap Filter -23 22 21 20 -23 22 21 20 C0 = 1 0 0 1 (-7) C1 = 0 1 1 0 ( 6) X X0 = 0 1 1 1 ( 7) X X1 = 0 1 0 1 ( 5) + + + + ( 1 0 0 1 ( 1 0 0 1 ( 1 0 0 1 (0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 0) 0 0 0 0 ) 0 1 1 0 ) 0 0 0 0 ) 0 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 = 1 1 1 0 1 1 0 1 (-1) (-14) (-4) (0) (-19) (-49) ( 30) (Serial-Data / Tap-Parallel Multiply) = Sign Extension

SDA 2-Tap FIR Filter 0000...0 C0 Z-1 C1 C0 + C1 1 00 01 10 11 LUT contains all possible sums of the partial products N BITS WIDE SAMPLE DATA Partial Product ROM A0 X0 +/- A1 X1 Scaling Accumulator

SDA 4-Tap FIR Filter Z-1 +/- Scaling Accumulator N BITS WIDE SAMPLE DATA Partial Product ROM A0 0000...0 X0 C0 1 + A1 0000...0 X1 C1 1 + A2 0000...0 X2 C2 1 + A3 0000...0 X3 C3

SDA 8-Tap FIR Filter 1 1 1 1 1 1 1 Z-1 N BITS WIDE SAMPLE DATA A0 Partial Product ROM X0 A1 X1 A2 Pre-Adder X2 A3 X3 + +/- A0 X4 Partial Product ROM Scaling Accumulator A1 X5 A2 X6 4 -input LUT contains all possible sums of the partial products A3 X7

60 Single MAC DA FIR B=8 50 DA FIR B=12 40 DA FIR B=16 Sample Rate (MSPS) 30 Serial FPGA FIR 20 10 0 0 50 100 150 200 250 Xilinx DA FIR Performance 6000 Dual MAC DA FIR B=8 5000 DA FIR B=12 4000 DA FIR B=16 3000 Performance (MMACs/s) Serial FPGA FIR 2000 1000 0 0 50 100 150 200 250 Filter Length (Taps) Filter Length (Taps) fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA

Trade Clock Cycles for Logic Area Trade Clock Cycles for Logic Area Multi bits per clock cycle 160Ms/s 20Ms/s b7 b7 b7 Serial-DA Parallel-DA b4 b3 b0 Hardware Over-sampling = 4 b0 Hardware Over-sampling = 8 Hardware Over-sampling = 2 b0 b0 b7 b3 Hardware Over-sampling = 1 b4 b0 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample The sample is processed in parallel 8 bits per clock cycle The sample is serialized and processed 4 bits per clock cycle b0

Conclusion • Efficiency of computation • Slow as its bit serial • Memory requirements

References • The role of Distributed Arithmetic in FPGA based signal processing, www.xilinx.com

Distributed Arithmetic

Distributed Arithmetic

Presentation Transcript

Arithmetic

Arithmetic

Arithmetic

Arithmetic

Arithmetic

Arithmetic

Arithmetic

Arithmetic algorithms

Distributed arithmetic

Arithmetic

Arithmetic

Arithmetic

Distributed Arithmetic

Arithmetic

Distributed Arithmetic: Implementations and Applications

Distributed Arithmetic (DA)

Arithmetic

Arithmetic

Arithmetic

Arithmetic

FPGA Based Hybrid LMS Algorithm Design on Distributed Arithmetic