Parallel FFT

Parallel FFT Sathish Vadhiyar

Sequential FFT – Quick Review • Twiddle factor – primitive nth root of unity in complex plane -

Sequential FFT – Quick Review • (n/2)th root of unity • 2 (n/2)-point DFTs

Sequential FFT – quick review X(0) Y(0) wn0 wn0 wn0 X(4) Y(1) wn2 wn1 wn4 X(2) Y(2) wn4 wn2 wn0 X(6) Y(3) wn6 wn3 wn4 X(1) Y(4) wn0 wn0 wn4 X(5) Y(5) wn4 wn2 wn5 X(3) Y(6) wn0 wn4 wn6 X(7) Y(7) wn4 wn6 wn7

Sequential FFT – recursive solution 1. procedure R_FFT(X, Y, n, w) 2. if (n=1) then Y[0] := X[0] else 3. begin 4. R_FFT(<X(0), X(2), …, X[n-2]>, <Q[0], Q[1], …, Q[n/2]>, n/2, w2) 5. R_FFT(<X(1), X(3), …, X[n-1]>, <T[0], T[1], …, T[n/2]>, n/2, w2) 6. for i := 0 to n-1 do 7. Y[i] := Q[i mod (n/2)] + wiT(i mod (n/2)]; 8. end R_FFT

Sequential FFT – iterative solution 1. procedure ITERATIVE_FFT(X, Y, n) 2. begin 3. r := log n; 4. for i:= 0 to n-1 do R[i] := X[i]; 5. for m:= 0 to r-1 do 6. begin 7. for i:= 0 to n-1 do S[i] := R[i]; 8. for i:= 0 to n-1 do 9. begin /* Let (b0, b1, b2, … br-1) be the binary representation of i */ 10. j := (b0 … bm-10bm+1 .. br-1); 11. k := (b0 … bm-11bm+1 .. br-1); 12. R[i] := S[j] + S[k] x w(bmbm-1..b00..0) ; 13. endfor; 14. endfor; 15. for i:= 0 to n-1 do Y[i] := R[i]; 16. end ITERATIVE_FFT

Example of w calculation For a given m and i, the power of w is computed by reversing the order of the m+1 most significant bits and padding them by 0’s to the right.

Parallel FFT – Binary exchange X(0) Y(0) 000 P0 001 X(1) Y(4) 010 X(2) Y(2) P1 011 X(3) Y(6) 100 X(4) Y(1) P2 101 X(5) Y(5) X(3) Y(3) 110 P3 111 X(7) Y(7) d r

Binary Exchange • d – number of bits for representing processes; r – number of bits representing the elements • The d most significant bits of element i indicate the process that the element belongs to. • Only the first d of the r iterations require communication • In a given iteration, a process communicates with only one other process • Total execution time - ? (n/P)logN + logP(l) + (n/P)logP (b)

Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 m = 0 m = 1 Phase 1 – FFTs along columns

Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 1 2 3 0 4 8 12 4 5 6 7 1 5 9 13 8 9 10 11 2 6 10 14 12 13 14 15 3 7 11 15 Phase 2 – Transpose

Parallel FFT – 2D Transpose P0 P1 P2 P3 P0 P1 P2 P3 0 4 8 12 0 4 8 12 1 5 9 13 1 5 9 13 2 6 10 14 2 6 10 14 3 7 11 15 3 7 11 15 m = 2 m = 3 Phase 3 – FFTs along columns

2D Transpose • In general, n elements arranged as √n x √n • p processes arranged along columns. Each process owns √n/p columns • Each process does √n/p FFTs of size √n each • Parallel runtime – 2(√n/p)√nlog√n + (p-1)(l)+ n/p(b)

3D Transpose • n1/3 x n1/3 x n1/3 elements • √p x √p processes • Steps ? • Parallel runtime – (n/p)logn(c) + 2(√p-1)(l) + 2(n/p)(b)

In general • For q dimensions: • Parallel runtime – (n/p)logn + (q-1)(p1/(q-1) -1) [l]+ (q-1)(n/p) [b] • Time due to latency decreases; due to bandwidth increases • For implementation – only 2D and 3D transposes are feasible. Moreover, there are restrictions on n and p in terms of q.

Choice of algorithm • Binary exchange – small latency, large bandwidth • 2D transpose – large latency, small bandwidth • Other transposes lie between binary exchange and 2D transpose • For a given parallel computer, based on l and b, different algorithms can give different performances for different problem sizes

FFTW • Self-optimizing package • Adapting FFT computation to the details of the hardware • FFTW has codelet library. • Executor is the implementation of C-T algorithm composed of certain Codelets • Composition of executor is the plan. • FFTW has codelets for computing FFTs of integers upto 16 and powers of 2 upto 64.

FFTW • E.g., for solving a 128-point FFT: • divide-and-conquer(128, 4) • divide-and-conquer(32,8) • solve(4) • Optimal plan depends on processor, memory architecture and compiler • Uses search space techniques to conduct experiments with a select set of plan and obtain the optimum.

Parallel FFT

Parallel FFT

Presentation Transcript

FFT Survey

Parallel FFT and Use in PDE Solvers

FFT Presentation

FFT: Accelerator Project

P3DFFT: a scalable library for parallel FFT transforms in three dimensions

FFT Accelerator Project

FFT Filtering

FFT Final Presentation

FFT Accelerator Project

Reconfigurable FFT architecture

FFT

Reconfigurable FFT architecture

FFT Convolution

Parallel Processing Final Project Parallel FFT using to solve Poisson’s Equation

FFT ASPIRE

Parallel FFT in Julia

Parallel FFT