1 / 11

A New Class of High Performance FFTs

A New Class of High Performance FFTs. Dr. J. Greg Nash Centar (www.centar.net) jgregnash@centar.net High Performance Embedded Computing (HPEC) Workshop 19-21 September 2006. New Base-4 DFT Matrix Equation. Traditional DFT Matrix form: New Matrix form for DFT †

ciro
Download Presentation

A New Class of High Performance FFTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Class of High Performance FFTs Dr. J. Greg Nash Centar (www.centar.net) jgregnash@centar.net High Performance Embedded Computing (HPEC) Workshop 19-21 September 2006

  2. New Base-4 DFT Matrix Equation • Traditional DFT Matrix form: • New Matrix form for DFT† • CM 1 and CM 2 contain only elements from the set • CM 1X and CM 2Yt only involve complex additions/subtractions • Twiddle factor matrix WM is of size N/4 x N/4 rather than N x N of C • x16 fewer multiplies than traditional DFT equation (Z=CX) “ ”= element by element multiply †J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform, ” IEEETransactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp. 4640 – 4651.

  3. Find Systolic Architecture Using SPADE† Simulator, Graphical Outputs Mathematical Algorithm Input Code Automatic Search for Space-Time Transformations, T for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; FPGA Architectural Constraints Objective Functions -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory †Symbolic Parallel Algorithm Development Environment

  4. Functional Operation • Processing flow for DFT of length N = N1 * N2 • Stage 1: N2 column DFTs (Xci) of length N1 • Stage 2: Twiddle multiplication • Stage 3: N1 row DFTs (Xri) of length N2 • Systolic adder arrays for matrix multiplication • N1/4 x 4 array for column multiplies CM1Xci and CM2Ytci • N2/4 x 4 array for row multiplies CM1Xriand CM2Ytri • N2/4 x 4 array is implemented virtually on one row of N1/4 x 4 array • Uses systolic 1-D array matrix multiplication

  5. FFT Systolic Architecture Example Architecture for N = 1024 (N1 = N2 = 32) • Simple PEs, locally connected • Higher clock speeds • Easier design/test/maintainability • Lower power • Efficient use of FPGA fabric • Simple control • Small memory blocks (one per PE) • Faster read/write times • Lower power • Linear structure (scales in N/S direction) • Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers)

  6. Enhanced Functionality • Transform size N not restricted to powers of two • N = 256n, (n = 1,2,3,..) • More reachable points • Uniform distribution of points • Circuit is scalable • Any DFT size can be computed on the same hardware with sufficient memory • Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks • Low computational latency • Pipeline depth small, vs for traditional pipelined FFTs • 1-D and 2-D transforms possible on the same circuit

  7. Block Floating Point/Floating Point Operation • Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT) • Column DFTs use block floating point and row DFTs use floating point • Higher dynamic range and lower signal to noise ratio • Number of regions increases with transform size • Supports streaming FFT’s • Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):

  8. Performance Comparison: 256-point DFT • Altera block floating point circuit • “Streaming” (continuous data in and out) • Comparable dynamic range and signal to (roundoff) noise ratio • Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA • Altera circuit from Megacore FFT v2.2.0 • Results from timing analysis (Altera Quartus 5.1 software)

  9. Preliminary Figure of Merit • Altera block floating point circuits • “Streaming” (continuous data in and out) • Comparable dynamic range and signal to noise ratio • Circuits mapped to Altera Stratix II FPGAs • Altera circuit from Megacore FFT v2.2.0 FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz) *Estimate (no timing analysis or layout)

  10. Performance Comparison: 256-point DFT

  11. Comparative Features • Transform size N not restricted to powers of two • Circuit is scalable • Uses block floating point and floating point • Higher throughput • Low computational latency • Based on small, simple PE (adder), locally connected • 1-D or 2-D transforms

More Related