200 likes | 323 Views
Parallel FFT. Sathish Vadhiyar. Sequential FFT – Quick Review. Twiddle factor – primitive n th root of unity in complex plane -. Sequential FFT – Quick Review. (n/2) th root of unity 2 (n/2)-point DFTs.
E N D
Parallel FFT Sathish Vadhiyar
Sequential FFT – Quick Review • Twiddle factor – primitive nth root of unity in complex plane -
Sequential FFT – Quick Review • (n/2)th root of unity • 2 (n/2)-point DFTs
Sequential FFT – quick review X(0) Y(0) wn0 wn0 wn0 X(4) Y(1) wn2 wn1 wn4 X(2) Y(2) wn4 wn2 wn0 X(6) Y(3) wn6 wn3 wn4 X(1) Y(4) wn0 wn0 wn4 X(5) Y(5) wn4 wn2 wn5 X(3) Y(6) wn0 wn4 wn6 X(7) Y(7) wn4 wn6 wn7
Sequential FFT – recursive solution 1. procedure R_FFT(X, Y, n, w) 2. if (n=1) then Y[0] := X[0] else 3. begin 4. R_FFT(<X(0), X(2), …, X[n-2]>, <Q[0], Q[1], …, Q[n/2]>, n/2, w2) 5. R_FFT(<X(1), X(3), …, X[n-1]>, <T[0], T[1], …, T[n/2]>, n/2, w2) 6. for i := 0 to n-1 do 7. Y[i] := Q[i mod (n/2)] + wiT(i mod (n/2)]; 8. end R_FFT
Sequential FFT – iterative solution 1. procedure ITERATIVE_FFT(X, Y, n) 2. begin 3. r := log n; 4. for i:= 0 to n-1 do R[i] := X[i]; 5. for m:= 0 to r-1 do 6. begin 7. for i:= 0 to n-1 do S[i] := R[i]; 8. for i:= 0 to n-1 do 9. begin /* Let (b0, b1, b2, … br-1) be the binary representation of i */ 10. j := (b0 … bm-10bm+1 .. br-1); 11. k := (b0 … bm-11bm+1 .. br-1); 12. R[i] := S[j] + S[k] x w(bmbm-1..b00..0) ; 13. endfor; 14. endfor; 15. for i:= 0 to n-1 do Y[i] := R[i]; 16. end ITERATIVE_FFT
Example of w calculation For a given m and i, the power of w is computed by reversing the order of the m+1 most significant bits and padding them by 0’s to the right.
Parallel FFT – Binary exchange X(0) Y(0) 000 P0 001 X(1) Y(4) 010 X(2) Y(2) P1 011 X(3) Y(6) 100 X(4) Y(1) P2 101 X(5) Y(5) X(3) Y(3) 110 P3 111 X(7) Y(7) d r
Binary Exchange • d – number of bits for representing processes; r – number of bits representing the elements • The d most significant bits of element i indicate the process that the element belongs to. • Only the first d of the r iterations require communication • In a given iteration, a process communicates with only one other process • Total execution time - ? (n/P)logN + logP(l) + (n/P)logP (b)
Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 m = 0 m = 1 Phase 1 – FFTs along columns
Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 4 8 12 4 5 6 7 1 5 9 13 8 9 10 11 2 6 10 14 12 13 14 15 3 7 11 15 Phase 2 – Transpose
Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 4 8 12 0 4 8 12 1 5 9 13 1 5 9 13 2 6 10 14 2 6 10 14 3 7 11 15 3 7 11 15 m = 2 m = 3 Phase 3 – FFTs along columns
2D Transpose • In general, n elements arranged as √n x √n • p processes arranged along columns. Each process owns √n/p columns • Each process does √n/p FFTs of size √n each • Parallel runtime – 2(√n/p)√nlog√n + (p-1)(l)+ n/p(b)
3D Transpose • n1/3 x n1/3 x n1/3 elements • √p x √p processes • Steps ? • Parallel runtime – (n/p)logn(c) + 2(√p-1)(l) + 2(n/p)(b)
In general • For q dimensions: • Parallel runtime – (n/p)logn + (q-1)(p1/(q-1) -1) [l]+ (q-1)(n/p) [b] • Time due to latency decreases; due to bandwidth increases • For implementation – only 2D and 3D transposes are feasible. Moreover, there are restrictions on n and p in terms of q.
Choice of algorithm • Binary exchange – small latency, large bandwidth • 2D transpose – large latency, small bandwidth • Other transposes lie between binary exchange and 2D transpose • For a given parallel computer, based on l and b, different algorithms can give different performances for different problem sizes
FFTW • Self-optimizing package • Adapting FFT computation to the details of the hardware • FFTW has codelet library. • Executor is the implementation of C-T algorithm composed of certain Codelets • Composition of executor is the plan. • FFTW has codelets for computing FFTs of integers upto 16 and powers of 2 upto 64.
FFTW • E.g., for solving a 128-point FFT: • divide-and-conquer(128, 4) • divide-and-conquer(32,8) • solve(4) • Optimal plan depends on processor, memory architecture and compiler • Uses search space techniques to conduct experiments with a select set of plan and obtain the optimum.