340 likes | 574 Views
FFT Accelerator Project. Date: 14 th April, 2007. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). Objectives. Validation Vtune Memory analysis. Validation : radix-16. CLAIM: Computations of radix-16 are more than that of radix-4. Radix-4 FFT algorithm.
E N D
FFT Accelerator Project Date: 14th April, 2007 Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210)
Objectives • Validation • Vtune Memory analysis
Validation : radix-16 CLAIM: Computations of radix-16 are more than that of radix-4 Radix-4 FFT algorithm
Expanding it into 4 parts- Input points of radix-4 butterfly Output points of radix-4 butterfly
Radix-4 Butterfly Trivial multiplications: No extra complex multiplications 3 Complex multiplications
These are complex number multiplications (not trivial as in the case of radix-4)
Evaluating 16 terms in the innermost loop Mostly complex number multiplications (very few trivial cases) Total 16 x 16 = 256 terms are involved (This forms the innermost loop of our program ) These terms also constitute the radix-16 butterfly
Radix-16 butterfly complex complex complex j complex complex complex complex Conclusion: Number of complex computations in radix-16 are far greater than radix-4 -1 complex complex complex -j complex complex
Radix-16 butterfly complex j complex -1 complex -j complex
Analysis (as compared to radix-4) • Radix 16 is computationally more expensive • Radix-16 is slightly slow for small input values (16,256,4096) but much faster for large input values (65536, 1048576,16777216) • This can be verified from the comparison graphs shown earlier • The results are completely different if we use g++ (instead of icpc)
Why does 16-radix run faster than 4-radix (with icpc) ? • According to a website – Higher radix FFT implementations make significant improvements in program speed due to implicit loop-unrolling or other compiler benefits than from the computational reduction itself • This might explain our results
New Implementation (radix-8) • The number of complex multiplications in the inner loop less (as compared to radix-16 and radix-4) • It involves multiplying with special butterflies (e.g. ((1/2) - i(1/2)), etc.) • Multiplying with these complex numbers is trivial (takes less operations than the conventional complex multiplication)
Validation (for radix-8) Number of complex additions can still be brought down by disregarding redundancy
VTune Profiling for Radix-16 • Input size = 16, runs = 100,000
Memory Usage (Task Manager) Radix-16
Further Plan • Implement these on hardware • MPI version
References • Jones, D. Radix-4 FFT Algorithms, Connexions Web Site: www.cnx.org • Rivest, Cormen, Introduction to Algorithms • Intel website