110 likes | 128 Views
Explore a novel base-4 DFT matrix equation for efficient computation with a systolic architecture utilizing FPGA constraints. Benefits include scalability, low latency, and enhanced functionality for FFT circuits.
E N D
A New Class of High Performance FFTs Dr. J. Greg Nash Centar (www.centar.net) jgregnash@centar.net High Performance Embedded Computing (HPEC) Workshop 19-21 September 2006
New Base-4 DFT Matrix Equation • Traditional DFT Matrix form: • New Matrix form for DFT† • CM 1 and CM 2 contain only elements from the set • CM 1X and CM 2Yt only involve complex additions/subtractions • Twiddle factor matrix WM is of size N/4 x N/4 rather than N x N of C • x16 fewer multiplies than traditional DFT equation (Z=CX) “ ”= element by element multiply †J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform, ” IEEETransactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp. 4640 – 4651.
Find Systolic Architecture Using SPADE† Simulator, Graphical Outputs Mathematical Algorithm Input Code Automatic Search for Space-Time Transformations, T for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; FPGA Architectural Constraints Objective Functions -2-D mesh array -fine grained PEs (registers,adder,mux) -linear arrays of multipliers, memory †Symbolic Parallel Algorithm Development Environment
Functional Operation • Processing flow for DFT of length N = N1 * N2 • Stage 1: N2 column DFTs (Xci) of length N1 • Stage 2: Twiddle multiplication • Stage 3: N1 row DFTs (Xri) of length N2 • Systolic adder arrays for matrix multiplication • N1/4 x 4 array for column multiplies CM1Xci and CM2Ytci • N2/4 x 4 array for row multiplies CM1Xriand CM2Ytri • N2/4 x 4 array is implemented virtually on one row of N1/4 x 4 array • Uses systolic 1-D array matrix multiplication
FFT Systolic Architecture Example Architecture for N = 1024 (N1 = N2 = 32) • Simple PEs, locally connected • Higher clock speeds • Easier design/test/maintainability • Lower power • Efficient use of FPGA fabric • Simple control • Small memory blocks (one per PE) • Faster read/write times • Lower power • Linear structure (scales in N/S direction) • Matches fabric of FPGA linear distributed embedded elements (eg., memory and multipliers)
Enhanced Functionality • Transform size N not restricted to powers of two • N = 256n, (n = 1,2,3,..) • More reachable points • Uniform distribution of points • Circuit is scalable • Any DFT size can be computed on the same hardware with sufficient memory • Larger FFT circuits constructed by replication of identical 4x4 PE array processing blocks • Low computational latency • Pipeline depth small, vs for traditional pipelined FFTs • 1-D and 2-D transforms possible on the same circuit
Block Floating Point/Floating Point Operation • Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT) • Column DFTs use block floating point and row DFTs use floating point • Higher dynamic range and lower signal to noise ratio • Number of regions increases with transform size • Supports streaming FFT’s • Comparison of “single tone”, random frequency and phase data sets (DR= dynamic range, “noise” = roundoff noise):
Performance Comparison: 256-point DFT • Altera block floating point circuit • “Streaming” (continuous data in and out) • Comparable dynamic range and signal to (roundoff) noise ratio • Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA • Altera circuit from Megacore FFT v2.2.0 • Results from timing analysis (Altera Quartus 5.1 software)
Preliminary Figure of Merit • Altera block floating point circuits • “Streaming” (continuous data in and out) • Comparable dynamic range and signal to noise ratio • Circuits mapped to Altera Stratix II FPGAs • Altera circuit from Megacore FFT v2.2.0 FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz) *Estimate (no timing analysis or layout)
Comparative Features • Transform size N not restricted to powers of two • Circuit is scalable • Uses block floating point and floating point • Higher throughput • Low computational latency • Based on small, simple PE (adder), locally connected • 1-D or 2-D transforms