260 likes | 713 Views
High Speed FIR Filter Implementation Using Add and Shift Method. Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006. UC Santa Barbara. ICCD 2006. Outline. Introduction FIR filter implementation
E N D
High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006 UC Santa Barbara ICCD 2006
Outline • Introduction • FIR filter implementation • Traditional Methods • MAC (Multiply Accumulate) implementation • DA (Distributed Arithmetic) implementation • New method • Add and Shift method and CSE (Common Subexpresssion Elimination) • Experiments and results • Resource utilization • Power consumption • Conclusion UC Santa Barbara ICCD 2006
Introduction • Extensive use of FPGAs in computationally intensive applications such as DSP • More available logic resources in current FPGAs • Broad applications of FIR filters in multimedia and communications • Need to efficient design methods to save area/power • Research motivation • Develop a more efficient implementation method for FIR filters that consumes less area at comparable performance. • Develop a unified tool for performing redundancy elimination, scheduling and module assignment. • Perform physically aware optimizations. • Architecture design exploration for ASIC and FPGA implementations (Distributed Arithmetic based, adder-shifter based, multiplier-adder based). UC Santa Barbara ICCD 2006
FIR FilterMAC Implementation • L tap FIR filter • Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series.y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1 • Disadvantages • Large area on FPGA due to multipliers and the fact that full flexibility of general purpose multipliers are not required • Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs UC Santa Barbara ICCD 2006
FIR FilterDA (Distributed Arithmetic) Implementation • An alternative to MAC implementation which is the most common FPGA FIR implementation due to the LUT rich architecture of FPGAs. y[n] = ∑ c[n] ∙ x[n] n = 0, 1, …, N-1 • Variable x[n] can be represented by: x [n] = ∑ xb [n] ∙ 2b b=0, 1, …, B-1 xb [n] € [0, 1] where xb [n] is the bth bit of x[n] and B is the input width. The inner product can be rewritten as follows: UC Santa Barbara ICCD 2006
FIR FilterDA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k] ∙ 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0] 2B-2 + … + x0 [0]20 ) + c[1] (xB-1 [1] 2B-1 + xB-2 [1] 2B-2 + … + x0 [1] 20 ) + … + c[N-1] (xB-1 [N-1] 2B-1 + xB-2 [0] 2B-2 + … + x0 [N-1] 20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-1]) 2B-1 +(c[0] xB-1 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-1]) 2B-2 + … + (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1]) 20 = ∑ 2b ∑ c[n] ∙ xb [k] where n=0, 1, …, N-1 and b=0, 1, …, B-1 UC Santa Barbara ICCD 2006
DA (Distributed Arithmetic) ImplementationSerial A Serial DA Filter Block Diagram • n+1 clock cycles are needed for an n but input symmetrical filter to generate the output. • Performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed • The tradeoff here is performance for area UC Santa Barbara ICCD 2006
DA (Distributed Arithmetic) ImplementationParallel • The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups • Increasing the number of bits sampled has a significant effect on resource utilization on FPGA. • More LUTs • Larger size scaling accumulator A 2 bit parallel DA Filter Block Diagram UC Santa Barbara ICCD 2006
CSE (Common Subexpression Elimination) • Linear systems can be modeled using polynomials. Expressions consist of +,-,<< operators. • Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1 UC Santa Barbara ICCD 2006
CSEExample Y0 = X0 + X1 + X2 + X3 Y1 = 2X0 + X1 – X2 – 2X3 Y2 = X0 – X1 – X2 + X3 Y3 = X0 – 2X1 + 2X2 – X3 Y0 1 1 1 1 X0 Y1 = 2 1 -1 -2 X1 Y2 1 -1 -1 1 X2 Y3 1 -2 2 -1 X3 Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 – X2 – X3L Y2 = X0 – X1 – X2 + X3 Y3 = X0 – X1L + X2L – X3 UC Santa Barbara ICCD 2006
CSEExample D0 = (X0 + X3) D1 = (X1 – X2) Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 Y0 = D0 + X1 + X2 Y1 = X0L + X1 - X2 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - X1L + X2L - X3 UC Santa Barbara ICCD 2006
Y0 = D0 + X1 + X2 Y1 = X0L + D1 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - D1L - X3 CSEExample D2 = (X1 + X2) D3 = (X0 – X3) Y0 = D0 + D2 Y1 = X0L + D1 -X3L Y2 = D0 - D2 Y3 = X0 - D1L - X3 UC Santa Barbara ICCD 2006
CSEExample 12 additions 4 shifts Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 D0 = X0 + X3 Y0 = D0 + D2 D1 = X1 – X2 Y1 = D1 + D3L D2 = X1 + X2 Y2 = D0 - D2 D3 = X0 - X3 Y3 = D3 – D1L 8 additions 2 shifts UC Santa Barbara ICCD 2006
FIR Filter Add/Shift ImplementationReplacing Constant Multiplication by Multiplier Block UC Santa Barbara ICCD 2006
FIR Filter Add/Shift ImplementationRegistered Adder at no Additional Cost UC Santa Barbara ICCD 2006
Extracting Common Subexpressions F1 = A + B + C + D F2 = A + B + C + E Optimization Extracting Common Expression (A + B + C) Unoptimized Expression Trees Extracting Common Expression (A + B) UC Santa Barbara ICCD 2006
Synchronization • Extra registers are needed to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle Calculating registers required for fastest evaluation UC Santa Barbara ICCD 2006
Experiment ResultsResource Utilization/Performance Filter Implementation Using Add and Shift Method Filter Implementation Using Xilinx Coregen (PDA) UC Santa Barbara ICCD 2006
Experiment ResultsResource Utilization UC Santa Barbara ICCD 2006
Experiment ResultsPower Consumption UC Santa Barbara ICCD 2006
Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006
Experiment ResultsComparison with MAC Filters Using Multiplier Blocks UC Santa Barbara ICCD 2006
Experiment ResultsComparison with MAC Filters Using Multiplier Blocks – Resource Utilization UC Santa Barbara ICCD 2006
Experiment ResultsComparison with MAC Filters Using Multiplier Blocks - Performance UC Santa Barbara ICCD 2006
Conclusion/Observations • Presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. • Validated our techniques on Virtex II/IV devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. • an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. • Better performance in most of the cases even though our algorithm does not optimize for performance • Observed up to 50% reduction in dynamic power consumption • Higher performance as the filter size increases. • Critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. UC Santa Barbara ICCD 2006