Implementation of Digital Filters in FPGA’s

Implementation of Digital Filters in FPGA’s AyazHasan

References • Chi-Jui Cou, Satish Mohankrishnan, Joseph B Evans, “FPGA Implementation of Digital Filters,” ICSPAT 1993 • Uwe Meyer-Baese, “Digital Signal Processing with Field Programmable Gate Arrays,” 2003

Outline • Digital Filtering • Programmable Signal Processors vs. FPGA’s • Multiply Accumulate Units • Multipliers • Adders • Xilinx XC4000 implementations • FIR Filters • Pipelined MAC units

Digital Filters • Modification of signal attributes in frequency or time domain • Linear Time-Invariant Filters • FIR Filters • Finite sum per output sample instant • IIR Filters • Infinite sum

FIR Filters • Transfer Function • Lth order filter • Tapped Delay Structure • One of the multiplicands is an FIR coefficient • Non-recursive • No feedback • Finite Response

IIR Filters • Transfer Function • Recursive Filter • Feedback • Canonical Filter • Has both recursive and non-recursive parts merged

Programmable Signal Processors • Based on RISC architecture • At least one fast array multiplier (fixed or floating point) • Most algorithms MAC intensive • High MAC rates using multi-stage pipelined architecture • Cost effective

FPGA’s • Can provide more bandwidth • Multiple MAC cells on a chip • Useful in high-bandwidth applications like wireless and multimedia • More efficient in implementing certain algorithms • CORDIC • Number Theoretic Transforms • Error-correction algorithms

FPGA vs PDSP • PDSP • Complicated algorithms that contain several if-then-else constructs • FPGA • Front-end applications • FIR filters • CORDIC algorithms • FFT’s

Target Device – Xilinx XC4000 • Basic logic element – Configurable Logic Block • Two separate 4-input, 1-output Lookup Tables • General purpose logic functions • Fast carry • One 3-input, 1-output LUT two combine two LUTs • Two flip flops • Five levels of routing • From CLB to CLB to long lines spanning the entire chip • Important in issues of speed • Can be used as 16x2 or 32x1 RAM or ROM

Xilinx XC4000 CLB

Multiply Accumulate Units • DSP algorithms are MAC intensive • Several approaches • Array approach • Addition using ripple carry methods • Linear convolution sum • L consecutive multiplications • L – 1 addition operations per sample • N x N-bit multipliers need to be fused together with an accumulator • Full N x N product is 2N bits wide, 2N-1 for signed #’s

MAC Unit • MAC Components • 8 x 8 bit combinatorial array multiplier • 16-bit accumulator • Word sizes constrained by FPGA density • Larger word sizes possible if MAC units per chip reduced

Multiplier • One CLB per partial product bit • 2-input AND gate generates each partial product • Addition logic • 64 CLB’s used • Signed Multiplication • Basic Cell Structure • Sum • Carry • xi AND ai

Multiplier Implementation • ak ≠ 0 • Accumulation of X2k • ak = 0 • No operation

Adder Implementation • 16-bits • 9 CLB’s, each configured as 2-bit adder • 7 for middle 14 bits • 1 each for MSB and LSB • Dedicated CLB carry logic • Improved efficiency of adders • Cout of a CLB can only be connected to a CLB above or below it • Vertical array • Delay of 20.5ns

MAC Implementation • Performance • 100ns multiplier delay • 10 MHz • 73 CLB’s

FIR Filter MAC Unit • MAC unit with 4 multipliers and an adder tree • Pipeline registers increase clock speed • 4 terms summed every clock cycle • 4 taps: Sampling rate = frequency • 8 taps: Sampling rate = frequency/2 • Maximum sampling frequency • M = # of multipliers • T = multiplier delay • N = # of tap filters

FIR Filters • Performance • 100ns multiplier delay • 22.5ns adder delay • Routing delay may be up to75ns • 10 MHz clock • Sampling rates of 40/N MHz

Pipelined MAC Units • Multiplier delay is a major limitation on maximum sampling rates • Pipelined array multipliers • Execution of separate multiplications overlaps • Carry propagating addition delay in last row of multiplier can be minimized • High sampling frequency can be achieved • Can be applied to previously mentioned FIR filters

Pipelined MAC Units • Basic cells identical to unpipelined ones • Include pipeline registers • To propagate multiplier and multiplicand bits to the destination • To propagate product bits that have been completed, done in parallel with new batch of product bits • N x N multiplier • Carry propagate adder replaced with N rows of half adders with pipeline registers between the rows • Allows carry propagation of only one position between any two consecutive rows • Clock speed depends only on the delay in multiplier cells

Pipelined MAC Units • For multiple tap filters • Accumulation of results needed through feedback of past output • Done by a set of full adders immediately below the diagonal of the array, feeding back outputs of full adders to their inputs through a single register • Clock rate • Approaches 100MHz for XC4000

4 x 4 Multiplier 6-Bit Accumulator • 4 MSB’s of multiplier fed back for accumulation • Output clocked out and accumulator reset after process complete • Filter coefficients and delayed inputs fed to multiplier in synchronized data streams • Arrivals corresponding to basic clock rate • N tap filter requires N+1 clock cycles for computation of one output

FPGA Implementation • Routing delay critical • 3ns for output pipeline register to stabilize after clocking • Output then routed • Then 4.5ns delay in the next CLB • Total minimum delay 7.5ns • In addition, 3ns from pad to input • Some CLB’s can be used as registers between input pads and cells, preventing reduction of clock speed

FPGA Implementation • 8 x 8 multiplier and 12-bit accumulator • 4.6ns worst case routing delay • 12.1ns worst case logic path delay • 80MHz clock rate • 2 MAC units can be accommodated in XC4013

Conclusion • FPGA approach to digital filter implementation • Higher sampling rates than traditional DSP chips • Lower costs than ASICs for moderate volume • More flexibility • MAC units on a single FPGA • FIR Filter Implementation

Questions

Implementation of Digital Filters in FPGA’s