FFT Accelerator Project

FFT Accelerator Project Date: 14th April, 2007 Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210)

Objectives • Validation • Vtune Memory analysis

Validation : radix-16 CLAIM: Computations of radix-16 are more than that of radix-4 Radix-4 FFT algorithm

Expanding it into 4 parts- Input points of radix-4 butterfly Output points of radix-4 butterfly

Radix-4 Butterfly Trivial multiplications: No extra complex multiplications 3 Complex multiplications

Radix-16 FFT Algorithm

These are complex number multiplications (not trivial as in the case of radix-4)

Evaluating 16 terms in the innermost loop Mostly complex number multiplications (very few trivial cases) Total 16 x 16 = 256 terms are involved (This forms the innermost loop of our program ) These terms also constitute the radix-16 butterfly

Radix-16 Butterfly 

Radix-16 butterfly  complex complex complex j complex complex complex complex Conclusion: Number of complex computations in radix-16 are far greater than radix-4 -1 complex complex complex -j complex complex

Radix-16 butterfly  complex j complex -1 complex -j complex

But…

Analysis (as compared to radix-4) • Radix 16 is computationally more expensive • Radix-16 is slightly slow for small input values (16,256,4096) but much faster for large input values (65536, 1048576,16777216) • This can be verified from the comparison graphs shown earlier • The results are completely different if we use g++ (instead of icpc)

Radix-4 vs radix-16 (g++)

Radix-4 vs. radix-16 (g++)

Why does 16-radix run faster than 4-radix (with icpc) ? • According to a website – Higher radix FFT implementations make significant improvements in program speed due to implicit loop-unrolling or other compiler benefits than from the computational reduction itself • This might explain our results

New Implementation (radix-8) • The number of complex multiplications in the inner loop less (as compared to radix-16 and radix-4) • It involves multiplying with special butterflies (e.g. ((1/2) - i(1/2)), etc.) • Multiplying with these complex numbers is trivial (takes less operations than the conventional complex multiplication)

Validation (for radix-8) Number of complex additions can still be brought down by disregarding redundancy

Radix-4 vs. radix-8 (icpc)

VTune Profiling for Radix-16 • Input size = 16, runs = 100,000

Input size = 256, runs = 1000

Input size = 4096, runs = 500

Input size = 65536, runs = 100

Input size = 1048576, runs = 100

Memory Usage (Task Manager) Radix-16

Further Plan • Implement these on hardware • MPI version

References • Jones, D. Radix-4 FFT Algorithms, Connexions Web Site: www.cnx.org • Rivest, Cormen, Introduction to Algorithms • Intel website

Thank You

FFT Accelerator Project