A Volumetric 3-D FFT on Clusters of Multi-Core Processors

A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan Third French-Japanese PAAP Workshop

Outline • Background • Objectives • Approach • 3-D FFT Algorithm • Volumetric 3-D FFT Algorithm • Performance Results • Conclusion Third French-Japanese PAAP Workshop

Background • The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. • Parallel 3-D FFT algorithms on distributed-memory parallel computers have been well studied. • November 2008 TOP500 Supercomputing Sites • Roadrunner: 1,105.00 TFlops (129,600 Cores) • Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops (150,152 Cores) • Recently, the number of cores keeps increasing. Third French-Japanese PAAP Workshop

Background (cont’d) • A typical decomposition for performing a parallel 3-D FFT is slabwise. • A 3-D array is distributed along the third dimension . • must be greater than or equal to the number of MPI processes. • This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors. Third French-Japanese PAAP Workshop

Related Works • Scalable framework for 3-D FFTs on the Blue Gene/L supercomputer[Eleftheriou et al. 03, 05] • Based on a volumetric decomposition of data. • Scale well up to 1,024 nodes for 3-D FFTs of size 128x128x128. • 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07] • 3-D FFTs of size 128x128x128 can scale well on QCDOC up to 4,096 nodes. Third French-Japanese PAAP Workshop

Objectives • Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors. • Reduce the communication time for larger numbers of MPI processes. • A comparison between 1-D and 2-D distribution for 3-D FFT. Third French-Japanese PAAP Workshop

Approach • Some previously presented volumetric 3-D FFT algorithms[Eleftheriou et al. 03, 05, Fang07]uses the 3-D distribution for 3-D FFT. • These schemes require three all-to-all communications. • We use a 2-D distribution for volumetric 3-D FFT. • It requires only two all-to-all communications. Third French-Japanese PAAP Workshop

3-D FFT • 3-D discrete Fourier transform (DFT) is given by Third French-Japanese PAAP Workshop

1-D distribution along z-axis 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a slab decomposition Third French-Japanese PAAP Workshop

2-D distribution along y- and z-axes 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a volumetric domain decomposition Third French-Japanese PAAP Workshop

Communication time of 1-D distribution • Let us assume for -point FFT: • Latency of communication: (sec) • Bandwidth: (Byte/s) • The number of processors: • One all-to-all communication • Communication time of 1-D distribution (sec) Third French-Japanese PAAP Workshop

Communication time of 2-D distribution • Two all-to-all communications • simultaneous all-to-all communications of processors in y-axis. • simultaneous all-to-all communications of processors in z-axis. • Communication time of 2-D distribution (sec) Third French-Japanese PAAP Workshop

Comparing communication time • Communication time of 1-D distribution • Communication of 2-D distribution • By comparing two equations, the communication time of the 2-D distribution is less than that of the1-D distribution for larger number of processors and latency . Third French-Japanese PAAP Workshop

Performance Results • To evaluate parallel 3-D FFTs, we compared • 1-D distribution • 2-D distribution • and -point FFTs on from 1 to 4,096 cores. • Target parallel machine: • T2K-Tsukuba system (256 nodes, 4,096 cores). • The flat MPI programming model was used. • MVAPICH 1.2.0 was used as a communication library. • The compiler used was Intel Fortran compiler 10.1. Third French-Japanese PAAP Workshop

T2K-Tsukuba System • Specification • The number of nodes: 648（Appro Xtreme-X3 Server） • Theoretical peak performance: 95.4 TFlops • Node configuration: 4-socket of quad-core AMD Opteron 8356 (Barcelona 2.3 GHz) • Total main memory size: 20 TB • Network interface: DDR InfiniBand Mellanox ConnectX HCA x 4 • Network toporogy: Fat Tree • Full-bisection bandwidth: 5.18 TB/s Third French-Japanese PAAP Workshop

PCI-X PCI-X PCI-X Computation Node of T2K-Tsukuba Dual Channel Reg DDR2 2GB 667MHz DDR2 DIMM x4 2GB 667MHz DDR2 DIMM x4 Hyper Transport 8GB/s (Full-duplex) 2GB 667MHz DDR2 DIMM x4 2GB 667MHz DDR2 DIMM x4 4GB/s (Full-duplex) 4GB/s (Full-duplex) 8GB/s 8GB/s (A)1 (A)2 PCI-Express X16 PCI-Express X16 X16 X16 Bridge NVIDIA nForce 3600 Bridge NVIDIA nForce 3050 (B)1 (B)2 PCI-Express X8 PCI-Express X8 X8 X8 4GB/s (Full-duplex) 4GB/s (Full-duplex) X4 X4 SAS Bridge I/O Hub USB Mellanox MHGH28-XTC ConnectX HCA x2 (1.2µs MPI Latency, 4X DDR 20Gb/s) Mellanox MHGH28-XTC ConnectX HCA x2 (1.2µs MPI Latency, 4X DDR 20Gb/s) Third French-Japanese PAAP Workshop

Third French-Japanese PAAP Workshop

Discussion (1/2) • For -point FFT, we can clearly see that communication overhead dominates the execution time. • In this case, the total working set size is only 1MB. • On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT. • Performance on 4,096 cores is over 401 GFlops, about 1.1% of theoretical peak. • Performance except for all-to-all communications is over 10 TFlops, about 26.7% of theoretical peak. Third French-Japanese PAAP Workshop

Discussion (2/2) • For , the performance of the 1-D distribution is better than that of the 2-D distribution . • This is because that the total communication amount of the 1-D distribution is a half of the2-D distribution. • However, for , the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency. Third French-Japanese PAAP Workshop

Conclusions • We implemented of a volumetric parallel 3-D FFT on clusters of multi-core processors. • We showed that a 2-D distribution improves performance effectively by reducing the communication time for larger numbers of MPI processes. • The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors. • We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT. Third French-Japanese PAAP Workshop

A Volumetric 3-D FFT on Clusters of Multi-Core Processors