170 likes | 342 Views
EADS: Accelerator Project. Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210). Speed up scientific application. Application. Candidate Partition. Performance Prediction. Choose next partition. 28 th January : Figure out the best algorithm of FFT
E N D
EADS: Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210)
Speed up scientific application Application Candidate Partition Performance Prediction Choose next partition
28th January : Figure out the best algorithm of FFT Compare the algos on the following parameters – - Execution Time - No. of multiplications - No. of additions 19th February : Study hardware implementation of FFT. .... Time lines (tentative)
radix : The "radix" is the size of an FFT decomposition twiddle factors:"Twiddle factors" are the coefficients used to combine results from a previous stage to form inputs to the next stage Terminologies
First Implementation • Implemented Recursive radix-4 FFT • analysed this using gprof • Looked into other FFT implementations • iterative • parallel • split radix
Analysis of the implementation • Considered FFT of 1024 random points (double) • Results from gprof -> • No. of Complex multiplications : 21760 • No. of Complex additions : 7680 • (Each complex multiplication consists of 4 real multiplications and 2 real additions) • (Each complex addition/subtraction consists of 2 real additions/subtractions)
Problems with this implementation • Inefficient use of memory (recursive procedure) • Wasted computations (some factors computed multiple times) • Maximum time utilized in computing Twiddle factors (complex number multiplications)
2nd Implementation • Radix-4 iterative in-place implementation - iterativeFFT(a) BitReversal(a,A) n length(a) for(s 1 to log4(n)) // logarithm is of base 4 { do m 4s ω e2Лi/m for(k0 to n-1 by m) { do τ 1 for(j0 to m/4) { tA[k+j] u τ A[k+j+m/4] v τ2A[k+j+2*m/4] x τ3A[k+j+3*m/4] A[k+j]t+u+v+x A[k+j+m/4]t+(i)u-v-(i)x A[k+j+2*m/4]t-u+v-x A[k+j+3*m/4]t-(i)u-v+(i)x τ τ* ω } } }
Analysis of this implementation • Considered FFT of 1024 random points (double) • Results from gprof -> • No. of Complex multiplications : 14080 • No. of Complex additions/subtractions : 7680 • (Each complex multiplication consists of 4 real multiplications and 2 real additions) • (Each complex addition/subtraction consists of 2 real additions/subtractions)
Improvements • Precompute twiddle factors • Trade additions for multiplications • (it’s possible to multiply with 3 real multiplies and 5 real adds rather than usual 4 real multiplies and 2 real adds) • use compiler flags (10%-15% execution time on some systems) • -O3 • -march=pentiumpro • -ffast-math • -fomit-frame-pointer
Some results • Precomputing twiddle factors: • No. of multiplications : 8960 • 5120 less multiplications (complex) • Trading multiplications for additions • Did not show any appreciable decline in execution time • Using compiler flags • Drastic improvement in execution time
Further enhancements possible • Use higher radix – 8,16,32, etc. • Use split-radix or Winograd algorithms • If data is real, we can have great improvements • Use Fast Bit-Reversal method (IEEE D.M.W. Evans)
Resources • Rivest, Cormen • Numerical Recipes in C • IEEE papers • Conversion of Digit-Reversed to Bit-Reversed order in FFT algorithms (Panos E. and C.S. Burrus) • The Design and Implementation of FFTW3 (Matteo Frigo and Steven G. Johnson) • cnx.org • Other fft implementations on the net • Best: fftw