420 likes | 726 Views
FFT in Hardware and Software. Background. Core Algorithm Original Algorithm, the DFT, O(n 2 ) complexity New Algorithm, the FFT (Fast Fourier Transform), O(nlog 2 (n)) depending on implementation. DFT Computation.
E N D
Background • Core Algorithm • Original Algorithm, the DFT, O(n2) complexity • New Algorithm, the FFT (Fast Fourier Transform), O(nlog2(n)) depending on implementation.
DFT Computation • A summation over the whole input array for every single element in the output array. • A VERY computationally inefficient algorithm to implement.
FFT Computation • A much more computationally efficient algorithm • Works using the divide and conquer principle. • First developed by Cooley and Tukey in 1965!
FFT Butterfly Operations • Butterfly arrangement of computations • Repeated on successive pairs of input data • Then half as many times on alternating pairs • Then half again as many times on every fourth element • …
xe[n] X[n] WnN xo[n] X[n+N/2] -WnN The Butterfly • Simple operations repeated many times
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationThe Entire Calculation Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT Demonstration Input Array Output Multiplication by W factor Addition
Why Hardware? • Even more speed for FFT • Extremely parallelizable • A whole layer can be done in two FPGA clock cycles • 1 multiply cycle • 1 add cycle • (Assuming sufficient multipliers)
Hardware Problems • Complexity • Input speed • Output speed • If the FPGA takes 24.4ns but takes 20s to transfer the input data, what gain is there? • i.e. 24.4ns + 20s + 20s = ~40s!
Mitigation of Hardware Problems • Use a faster bus • AMD Opteron’s Hypertransport • 20.8 GB/s (166.4 Gb/s) per Link (V. 3) • Modules that fit into an AMD 64-bit Opteron Socket • http://www.drccomputer.com/pages/modules.html - xilinx based module • http://www.xtremedatainc.com/xd1000_brief.html - altera based module
Mitigation of Hardware Problems • Put the FPGA on the die with the DSP • Need silicon vendor support • FPGA can access memory on a very wide bus (i.e. 128 bits per cycle) • Implement the entire project in FPGA • Time consuming to program • Possibly insufficient room on the FPGA
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
+ + + + + + + + + + + + + + + + + 8-point FFT DemonstrationIn Hardware Input Array Output Multiplication by W factor Addition
Why Not Software? • Each butterfly must be done sequentially • Only slight parallelism enabled by a DSP like the TigerSHARC • Each Butterfly can be done in 2 cycles (after optimization).
Results of Testing • Linear Profiling of FFT Algorithm in C++
Results of Testing • Profiling of VHDL on FPGA • Butterfly takes 24.377ns to execute • 62% is computational, 38% is routing on FPGA
Product Offerings • Most DSP Vendors • Many FPGA Vendors (IP – Intellectual Property) • Microcontroller Vendors (i.e. Blackfin) • FFTW – The Fastest Fourier Transform in the West • AMD Math Core Library • Intel Library • Highly Optimized for the expected hardware
Published Results • The Radix 4 version delivers a 1 K points complex processing time of 25 microseconds at 200-MHz system speeds and uses only about 10 percent of the resources in a mid-range Stratix device. The Radix 2 is half the size of the Radix 4 and offers a 1 K points complex processing time of 50 microseconds at 200-MHz system speeds. Additional versions of the new cores are under development. [6]
References [1] Signals Systems and Transforms [2] James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput.19, 297–301 (1965). [3] http://www.drccomputer.com/pages/modules.html - xilinx based module [4] http://www.xtremedatainc.com/xd1000_brief.html - altera based module [5] http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_2353,00.html [6] http://www.us.design-reuse.com/news/news5650.html [7] http://www.4dsp.com/fft.htm