130 likes | 248 Views
FFT: Accelerator Project. Rohit Prakash Anand Silodia. Work done till now. Studied various FFT algorithms Implemented radix-4, recursive and iterative algorithms Optimized these Compared the results with FFTW RESULT- FFTW fares better than our implementation. Current Objectives.
E N D
FFT: Accelerator Project Rohit Prakash Anand Silodia
Work done till now • Studied various FFT algorithms • Implemented radix-4, recursive and iterative algorithms • Optimized these • Compared the results with FFTW RESULT- • FFTW fares better than our implementation
Current Objectives • Validate the number of complex calculations in our implementation with theoretical number of computations • Document the work done till now • Make a website of the project • Study FFTW code (also figure out the reasons for its efficiency) • Run the code on intel compiler (icc)/ visual c++
Validating the computations • Incorrect theoretical formula (cnx.org) • Theoretical formula (for no. of complex computations) = (11/4)*nlog4(n) =8960 (Correct) (3/4)*nlog4(n) = 3840 (Incorrect) Actual 8960
Documentation and website • Website of the project – • www.cse.iitd.ac.in/~cs1030186/btp • Includes the details and results of our experimentations (till last week)
Running on intel compiler icc • No improvement • Possible reasons – • Tested on Intel Pentium Mobile • This does not support optimizations like exploiting SSE3 instructions (-fast flag)
FFTW code • 56,489+ LOC (contains code written in Ocaml and C) • We decided to study why FFTW is so fast (before going into the code itself) • Text we came across in this context – • Design and implementation of FFTW3 (Matteo Frigo and Steven G. Johnson) • Documentation of FFTW
Why is FFTW fast? • The transform is computed by an executor, composed of highly optimized, composable blocks of C code called codelets • At runtime, a ‘planner’ finds an efficient way to compose codelets: it measures the speed of different plans and chooses the best using a dynamic programming algorithm • The executor interprets the plan with negligible overhead • Codelets are generated automatically and are fast
Contd… • The executor implements the recursive divide and conquer Cooley Tukey FFT algorithm • Basically, it adapts to hardware in order to maximize performance • ‘Performance has little to do with the number of operations.Fast code must exploit instruction level parallelism of the processor. It is important to write the code in such a way that C compiler can schedule it efficiently’
Contd… • It uses some tricky optimizations like – • It also exploits SIMD instructions
Further plan ? • Since FFTW supports MPI and adapts itself to the given hardware architecture, we may use it as it is.
References • www.fftw.org • The Design and Implementation of FFTW3 (Matteo Frigo and Steven G. Johnson) • The Fastest Fourier Transform in the West (Matteo Frigo and Steven G. Johnson)