220 likes | 391 Views
Automatic Tuning for Parallel FFTs. Daisuke Takahashi University of Tsukuba, Japan. Outline. Background Objectives Approach Block Six-Step/Nine-Step FFT Algorithm Automatic Tuning for Parallel FFTs Performance Results Conclusion. Background.
E N D
Automatic Tuning for Parallel FFTs Daisuke Takahashi University of Tsukuba, Japan Second French-Japanese PAAP Workshop
Outline • Background • Objectives • Approach • Block Six-Step/Nine-Step FFT Algorithm • Automatic Tuning for Parallel FFTs • Performance Results • Conclusion Second French-Japanese PAAP Workshop
Background • The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. • Parallel FFT algorithms on distributed-memory parallel computers have been well studied. • Many numerical libraries with an automatic performance tuning have been developed, e.g., ATLAS, FFTW, and I-LIB. Second French-Japanese PAAP Workshop
Background (cont’d) • One goal for large FFTs is to minimize the number of cache misses. • Many FFT algorithms work well when data setsfit into a cache. • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. • We modified the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”. Second French-Japanese PAAP Workshop
Related Works • FFTW [Frigo and Johnson (MIT)] • The recursive call is employed to access main memory hierarchically. • This technique is very effective in the case that the total amount of data is not so much greater than the cache size. • For 1-D parallel MPI FFT, the six-step FFT is used. • http://www.fftw.org • SPIRAL [Pueschel et al. (CMU)] • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms. • http://www.spiral.net Second French-Japanese PAAP Workshop
FFTE: A High-Performance FFT Library • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions. • It includes complex, mixed-radix and parallel transforms. • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI) • It also supports Intel’s SSE2/SSE3 instructions. • HPC Challenge Benchmark • FFTE’s 1-D parallel FFT routine has been incorporated into the HPC Challenge (HPCC) benchmark • http://www.ffte.jp Second French-Japanese PAAP Workshop
Objectives • To improve the performance, we need to select the optimal parameters according to the computational environment and the problem size. • We implement an automatic tuning facility for parallel 1-D FFT routine in the FFTE library. Second French-Japanese PAAP Workshop
Discrete Fourier Transform (DFT) • DFT is given by Second French-Japanese PAAP Workshop
2-D Formulation • If has factors and then Second French-Japanese PAAP Workshop
Six-Step FFT Algorithm individual Transpose -point FFTs Transpose Transpose Second French-Japanese PAAP Workshop
Block Six-Step FFT Algorithm PartialTranspose individual -point FFTs Transpose PartialTranspose Second French-Japanese PAAP Workshop
3-D Formulation • For very large FFTs, we should switch to a 3-D formulation. • If has factors , and then Second French-Japanese PAAP Workshop
Parallel Block Nine-Step FFT PartialTranspose All-to-all comm. PartialTranspose PartialTranspose Second French-Japanese PAAP Workshop
Automatic Tuning for Parallel FFTs • If the condition of is satisfied, then we can choose the arbitrary , and ,where . • In the original FFTE library, we chose • The blocking parameter can be also varied. • For a given , the best block size is determined by the L2 cache size. • In the original FFTE, for Xeon processor. • We implemented the automatic tuning facility for varying , , and . Second French-Japanese PAAP Workshop
Performance Results • To evaluate parallel 1-D FFTs, we compared • FFTE (ver 4.0) • FFTE (ver 4.0) with automatic tuning • FFTW (ver. 3.2alpha3) • “mpi-bench” with “PATIENT” planner was used. • Target parallel machine: • A 16-node dual-core Xeon PC cluster(Woodcrest 2.4GHz, 2GB SDRAM/node, Linux 2.6.18). • Interconnected through a Gigabit Ethernet switch. • Open MPI 1.2.5 was used as a communication library • The compilers used were Intel C compiler 10.1 and Intel Fortran compiler 10.1. Second French-Japanese PAAP Workshop
Results of Automatic Tuning on dual-core Xeon 2.4GHz PC cluster Second French-Japanese PAAP Workshop
Discussion • For N = 2^28 and P = 32, the FFTE with automatic tuning runs about 1.25 times faster than the FFTW. • Since the FFTW uses the six-step FFT, each column FFT does not fit into the L1 data cache. • Moreover, the FFTE exploits the SSE3 instructions. • These are two reasons why the FFTE is most advantageous than the FFTW. • We can clearly see that all-to-all communication overhead contributes significantly to the execution time. Second French-Japanese PAAP Workshop
Conclusions • We proposed the automatic tuning method for parallel 1-D FFTs on distributed-memory parallel computers. • A blocking algorithm for parallel 1-D FFTs utilizes cache memory effectively. • We found that the default parameters of the FFTE is not always optimal according to the results of the automatic tuning. • The performance of the FFTE with automatic tuning is better than that of the FFTW. Second French-Japanese PAAP Workshop