1 / 25

High Speed FIR Filter Implementation Using Add and Shift Method

High Speed FIR Filter Implementation Using Add and Shift Method. Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006. UC Santa Barbara. ICCD 2006. Outline. Introduction FIR filter implementation

sanne
Download Presentation

High Speed FIR Filter Implementation Using Add and Shift Method

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006 UC Santa Barbara ICCD 2006

  2. Outline • Introduction • FIR filter implementation • Traditional Methods • MAC (Multiply Accumulate) implementation • DA (Distributed Arithmetic) implementation • New method • Add and Shift method and CSE (Common Subexpresssion Elimination) • Experiments and results • Resource utilization • Power consumption • Conclusion UC Santa Barbara ICCD 2006

  3. Introduction • Extensive use of FPGAs in computationally intensive applications such as DSP • More available logic resources in current FPGAs • Broad applications of FIR filters in multimedia and communications • Need to efficient design methods to save area/power • Research motivation • Develop a more efficient implementation method for FIR filters that consumes less area at comparable performance. • Develop a unified tool for performing redundancy elimination, scheduling and module assignment. • Perform physically aware optimizations. • Architecture design exploration for ASIC and FPGA implementations (Distributed Arithmetic based, adder-shifter based, multiplier-adder based). UC Santa Barbara ICCD 2006

  4. FIR FilterMAC Implementation • L tap FIR filter • Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series.y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1 • Disadvantages • Large area on FPGA due to multipliers and the fact that full flexibility of general purpose multipliers are not required • Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs UC Santa Barbara ICCD 2006

  5. FIR FilterDA (Distributed Arithmetic) Implementation • An alternative to MAC implementation which is the most common FPGA FIR implementation due to the LUT rich architecture of FPGAs. y[n] = ∑ c[n] ∙ x[n] n = 0, 1, …, N-1 • Variable x[n] can be represented by: x [n] = ∑ xb [n] ∙ 2b b=0, 1, …, B-1 xb [n] € [0, 1] where xb [n] is the bth bit of x[n] and B is the input width. The inner product can be rewritten as follows: UC Santa Barbara ICCD 2006

  6. FIR FilterDA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k] ∙ 2b = c[0] (xB-1 [0]2B-1 + xB-2 [0] 2B-2 + … + x0 [0]20 ) + c[1] (xB-1 [1] 2B-1 + xB-2 [1] 2B-2 + … + x0 [1] 20 ) + … + c[N-1] (xB-1 [N-1] 2B-1 + xB-2 [0] 2B-2 + … + x0 [N-1] 20 ) = (c[0] xB-1 [0] + c[1] xB-1 [1] + … + c[N-1] xB-1 [N-1]) 2B-1 +(c[0] xB-1 [0] + c[1] xB-2 [1] + … + c[N-1] xB-2 [N-1]) 2B-2 + … + (c[0] x0 [0] + c[1] x0 [1] + … + c[N-1] x0 [N-1]) 20 = ∑ 2b ∑ c[n] ∙ xb [k] where n=0, 1, …, N-1 and b=0, 1, …, B-1 UC Santa Barbara ICCD 2006

  7. DA (Distributed Arithmetic) ImplementationSerial A Serial DA Filter Block Diagram • n+1 clock cycles are needed for an n but input symmetrical filter to generate the output. • Performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed • The tradeoff here is performance for area UC Santa Barbara ICCD 2006

  8. DA (Distributed Arithmetic) ImplementationParallel • The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups • Increasing the number of bits sampled has a significant effect on resource utilization on FPGA. • More LUTs • Larger size scaling accumulator A 2 bit parallel DA Filter Block Diagram UC Santa Barbara ICCD 2006

  9. CSE (Common Subexpression Elimination) • Linear systems can be modeled using polynomials. Expressions consist of +,-,<< operators. • Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL3 + XL2 + XL1 UC Santa Barbara ICCD 2006

  10. CSEExample Y0 = X0 + X1 + X2 + X3 Y1 = 2X0 + X1 – X2 – 2X3 Y2 = X0 – X1 – X2 + X3 Y3 = X0 – 2X1 + 2X2 – X3 Y0 1 1 1 1 X0 Y1 = 2 1 -1 -2 X1 Y2 1 -1 -1 1 X2 Y3 1 -2 2 -1 X3 Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 – X2 – X3L Y2 = X0 – X1 – X2 + X3 Y3 = X0 – X1L + X2L – X3 UC Santa Barbara ICCD 2006

  11. CSEExample D0 = (X0 + X3) D1 = (X1 – X2) Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 Y0 = D0 + X1 + X2 Y1 = X0L + X1 - X2 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - X1L + X2L - X3 UC Santa Barbara ICCD 2006

  12. Y0 = D0 + X1 + X2 Y1 = X0L + D1 - X3L Y2 = D0 - X1 - X2 Y3 = X0 - D1L - X3 CSEExample D2 = (X1 + X2) D3 = (X0 – X3) Y0 = D0 + D2 Y1 = X0L + D1 -X3L Y2 = D0 - D2 Y3 = X0 - D1L - X3 UC Santa Barbara ICCD 2006

  13. CSEExample 12 additions 4 shifts Y0 = X0 + X1 + X2 + X3 Y1 = X0L + X1 - X2 - X3L Y2 = X0 - X1 - X2 + X3 Y3 = X0 - X1L + X2L - X3 D0 = X0 + X3 Y0 = D0 + D2 D1 = X1 – X2 Y1 = D1 + D3L D2 = X1 + X2 Y2 = D0 - D2 D3 = X0 - X3 Y3 = D3 – D1L 8 additions 2 shifts UC Santa Barbara ICCD 2006

  14. FIR Filter Add/Shift ImplementationReplacing Constant Multiplication by Multiplier Block UC Santa Barbara ICCD 2006

  15. FIR Filter Add/Shift ImplementationRegistered Adder at no Additional Cost UC Santa Barbara ICCD 2006

  16. Extracting Common Subexpressions F1 = A + B + C + D F2 = A + B + C + E Optimization Extracting Common Expression (A + B + C) Unoptimized Expression Trees Extracting Common Expression (A + B) UC Santa Barbara ICCD 2006

  17. Synchronization • Extra registers are needed to synchronize the intermediate values, such that new values for A,B,C,D,E,F can be read in every clock cycle Calculating registers required for fastest evaluation UC Santa Barbara ICCD 2006

  18. Experiment ResultsResource Utilization/Performance Filter Implementation Using Add and Shift Method Filter Implementation Using Xilinx Coregen (PDA) UC Santa Barbara ICCD 2006

  19. Experiment ResultsResource Utilization UC Santa Barbara ICCD 2006

  20. Experiment ResultsPower Consumption UC Santa Barbara ICCD 2006

  21. Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006

  22. Experiment ResultsComparison with MAC Filters Using Multiplier Blocks UC Santa Barbara ICCD 2006

  23. Experiment ResultsComparison with MAC Filters Using Multiplier Blocks – Resource Utilization UC Santa Barbara ICCD 2006

  24. Experiment ResultsComparison with MAC Filters Using Multiplier Blocks - Performance UC Santa Barbara ICCD 2006

  25. Conclusion/Observations • Presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. • Validated our techniques on Virtex II/IV devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. • an average reduction of 58.7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. • Better performance in most of the cases even though our algorithm does not optimize for performance • Observed up to 50% reduction in dynamic power consumption • Higher performance as the filter size increases. • Critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. UC Santa Barbara ICCD 2006

More Related