High Speed FIR Filter Implementation Using Add and Shift Method

High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006 UC Santa Barbara ICCD 2006

Outline • Introduction • FIR filter implementation • Traditional Methods • MAC (Multiply Accumulate) implementation • DA (Distributed Arithmetic) implementation • New method • Add and Shift method and CSE (Common Subexpresssion Elimination) • Experiments and results • Resource utilization • Power consumption • Conclusion UC Santa Barbara ICCD 2006

Introduction • Extensive use of FPGAs in computationally intensive applications such as DSP • More available logic resources in current FPGAs • Broad applications of FIR filters in multimedia and communications • Need to efficient design methods to save area/power • Research motivation • Develop a more efficient implementation method for FIR filters that consumes less area at comparable performance. • Develop a unified tool for performing redundancy elimination, scheduling and module assignment. • Perform physically aware optimizations. • Architecture design exploration for ASIC and FPGA implementations (Distributed Arithmetic based, adder-shifter based, multiplier-adder based). UC Santa Barbara ICCD 2006

FIR FilterMAC Implementation • L tap FIR filter • Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series.y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1 • Disadvantages • Large area on FPGA due to multipliers and the fact that full flexibility of general purpose multipliers are not required • Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs UC Santa Barbara ICCD 2006

FIR FilterDA (Distributed Arithmetic) Implementation • An alternative to MAC implementation which is the most common FPGA FIR implementation due to the LUT rich architecture of FPGAs. y[n] = ∑ c[n] ∙ x[n] n = 0, 1, …, N-1 • Variable x[n] can be represented by: x [n] = ∑ xb [n] ∙ 2b b=0, 1, …, B-1 xb [n] € [0, 1] where xb [n] is the bth bit of x[n] and B is the input width. The inner product can be rewritten as follows: UC Santa Barbara ICCD 2006

FIR FilterDA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k] ∙ 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0] 2B-2 + … + x0 [0]20 ) + c[1] (xB-1 [1] 2B-1 + xB-2 [1] 2B-2 + … + x0 [1] 20 ) + … + c[N-1] (xB-1 [N-1] 2B-1 + xB-2 [0] 2B-2 + … + x0 [N-1] 20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-1]) 2B-1 +(c[0] xB-1 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-1]) 2B-2 + … + (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1]) 20 = ∑ 2b ∑ c[n] ∙ xb [k] where n=0, 1, …, N-1 and b=0, 1, …, B-1 UC Santa Barbara ICCD 2006

DA (Distributed Arithmetic) ImplementationSerial A Serial DA Filter Block Diagram • n+1 clock cycles are needed for an n but input symmetrical filter to generate the output. • Performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed • The tradeoff here is performance for area UC Santa Barbara ICCD 2006

DA (Distributed Arithmetic) ImplementationParallel • The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups • Increasing the number of bits sampled has a significant effect on resource utilization on FPGA. • More LUTs • Larger size scaling accumulator A 2 bit parallel DA Filter Block Diagram UC Santa Barbara ICCD 2006

CSE (Common Subexpression Elimination) • Linear systems can be modeled using polynomials. Expressions consist of +,-,<< operators. • Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1 UC Santa Barbara ICCD 2006

CSEExample Y0 = X0 + X1 + X2 + X3 Y1 = 2X0 + X1 – X2 – 2X3 Y2 = X0 – X1 – X2 + X3 Y3 = X0 – 2X1 + 2X2 – X3 Y0 1 1 1 1 X0 Y1 = 2 1 -1 -2 X1 Y2 1 -1 -1 1 X2 Y3 1 -2 2 -1 X3 Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 – X2 – X3L Y2 = X0 – X1 – X2 + X3 Y3 = X0 – X1L + X2L – X3 UC Santa Barbara ICCD 2006

CSEExample D0 = (X0 + X3) D1 = (X1 – X2) Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 Y0 = D0 + X1 + X2 Y1 = X0L + X1 - X2 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - X1L + X2L - X3 UC Santa Barbara ICCD 2006

Y0 = D0 + X1 + X2 Y1 = X0L + D1 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - D1L - X3 CSEExample D2 = (X1 + X2) D3 = (X0 – X3) Y0 = D0 + D2 Y1 = X0L + D1 -X3L Y2 = D0 - D2 Y3 = X0 - D1L - X3 UC Santa Barbara ICCD 2006

CSEExample 12 additions 4 shifts Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 D0 = X0 + X3 Y0 = D0 + D2 D1 = X1 – X2 Y1 = D1 + D3L D2 = X1 + X2 Y2 = D0 - D2 D3 = X0 - X3 Y3 = D3 – D1L 8 additions 2 shifts UC Santa Barbara ICCD 2006

FIR Filter Add/Shift ImplementationReplacing Constant Multiplication by Multiplier Block UC Santa Barbara ICCD 2006

FIR Filter Add/Shift ImplementationRegistered Adder at no Additional Cost UC Santa Barbara ICCD 2006

Extracting Common Subexpressions F1 = A + B + C + D F2 = A + B + C + E Optimization Extracting Common Expression (A + B + C) Unoptimized Expression Trees Extracting Common Expression (A + B) UC Santa Barbara ICCD 2006

Synchronization • Extra registers are needed to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle Calculating registers required for fastest evaluation UC Santa Barbara ICCD 2006

Experiment ResultsResource Utilization/Performance Filter Implementation Using Add and Shift Method Filter Implementation Using Xilinx Coregen (PDA) UC Santa Barbara ICCD 2006

Experiment ResultsResource Utilization UC Santa Barbara ICCD 2006

Experiment ResultsPower Consumption UC Santa Barbara ICCD 2006

Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006

Experiment ResultsComparison with MAC Filters Using Multiplier Blocks UC Santa Barbara ICCD 2006

Experiment ResultsComparison with MAC Filters Using Multiplier Blocks – Resource Utilization UC Santa Barbara ICCD 2006

Experiment ResultsComparison with MAC Filters Using Multiplier Blocks - Performance UC Santa Barbara ICCD 2006

Conclusion/Observations • Presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. • Validated our techniques on Virtex II/IV devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. • an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. • Better performance in most of the cases even though our algorithm does not optimize for performance • Observed up to 50% reduction in dynamic power consumption • Higher performance as the filter size increases. • Critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. UC Santa Barbara ICCD 2006

High Speed FIR Filter Implementation Using Add and Shift Method