1 / 22

A Volumetric 3-D FFT on Clusters of Multi-Core Processors

A Volumetric 3-D FFT on Clusters of Multi-Core Processors. Daisuke Takahashi University of Tsukuba, Japan. Outline. Background Objectives Approach 3-D FFT Algorithm Volumetric 3-D FFT Algorithm Performance Results Conclusion. Background.

louis
Download Presentation

A Volumetric 3-D FFT on Clusters of Multi-Core Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan Third French-Japanese PAAP Workshop

  2. Outline • Background • Objectives • Approach • 3-D FFT Algorithm • Volumetric 3-D FFT Algorithm • Performance Results • Conclusion Third French-Japanese PAAP Workshop

  3. Background • The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. • Parallel 3-D FFT algorithms on distributed-memory parallel computers have been well studied. • November 2008 TOP500 Supercomputing Sites • Roadrunner: 1,105.00 TFlops (129,600 Cores) • Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops (150,152 Cores) • Recently, the number of cores keeps increasing. Third French-Japanese PAAP Workshop

  4. Background (cont’d) • A typical decomposition for performing a parallel 3-D FFT is slabwise. • A 3-D array is distributed along the third dimension . • must be greater than or equal to the number of MPI processes. • This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors. Third French-Japanese PAAP Workshop

  5. Related Works • Scalable framework for 3-D FFTs on the Blue Gene/L supercomputer[Eleftheriou et al. 03, 05] • Based on a volumetric decomposition of data. • Scale well up to 1,024 nodes for 3-D FFTs of size 128x128x128. • 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07] • 3-D FFTs of size 128x128x128 can scale well on QCDOC up to 4,096 nodes. Third French-Japanese PAAP Workshop

  6. Objectives • Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors. • Reduce the communication time for larger numbers of MPI processes. • A comparison between 1-D and 2-D distribution for 3-D FFT. Third French-Japanese PAAP Workshop

  7. Approach • Some previously presented volumetric 3-D FFT algorithms[Eleftheriou et al. 03, 05, Fang07]uses the 3-D distribution for 3-D FFT. • These schemes require three all-to-all communications. • We use a 2-D distribution for volumetric 3-D FFT. • It requires only two all-to-all communications. Third French-Japanese PAAP Workshop

  8. 3-D FFT • 3-D discrete Fourier transform (DFT) is given by Third French-Japanese PAAP Workshop

  9. 1-D distribution along z-axis 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a slab decomposition Third French-Japanese PAAP Workshop

  10. 2-D distribution along y- and z-axes 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a volumetric domain decomposition Third French-Japanese PAAP Workshop

  11. Communication time of 1-D distribution • Let us assume for -point FFT: • Latency of communication: (sec) • Bandwidth: (Byte/s) • The number of processors: • One all-to-all communication • Communication time of 1-D distribution (sec) Third French-Japanese PAAP Workshop

  12. Communication time of 2-D distribution • Two all-to-all communications • simultaneous all-to-all communications of processors in y-axis. • simultaneous all-to-all communications of processors in z-axis. • Communication time of 2-D distribution (sec) Third French-Japanese PAAP Workshop

  13. Comparing communication time • Communication time of 1-D distribution • Communication of 2-D distribution • By comparing two equations, the communication time of the 2-D distribution is less than that of the1-D distribution for larger number of processors and latency . Third French-Japanese PAAP Workshop

  14. Performance Results • To evaluate parallel 3-D FFTs, we compared • 1-D distribution • 2-D distribution • and -point FFTs on from 1 to 4,096 cores. • Target parallel machine: • T2K-Tsukuba system (256 nodes, 4,096 cores). • The flat MPI programming model was used. • MVAPICH 1.2.0 was used as a communication library. • The compiler used was Intel Fortran compiler 10.1. Third French-Japanese PAAP Workshop

  15. T2K-Tsukuba System • Specification • The number of nodes: 648(Appro Xtreme-X3 Server) • Theoretical peak performance: 95.4 TFlops • Node configuration: 4-socket of quad-core AMD Opteron 8356 (Barcelona 2.3 GHz) • Total main memory size: 20 TB • Network interface: DDR InfiniBand Mellanox ConnectX HCA x 4 • Network toporogy: Fat Tree • Full-bisection bandwidth: 5.18 TB/s Third French-Japanese PAAP Workshop

  16. PCI-X PCI-X PCI-X Computation Node of T2K-Tsukuba Dual Channel Reg DDR2 2GB 667MHz DDR2 DIMM x4 2GB 667MHz DDR2 DIMM x4 Hyper Transport 8GB/s (Full-duplex) 2GB 667MHz DDR2 DIMM x4 2GB 667MHz DDR2 DIMM x4 4GB/s (Full-duplex) 4GB/s (Full-duplex) 8GB/s 8GB/s (A)1 (A)2 PCI-Express X16 PCI-Express X16 X16 X16 Bridge NVIDIA nForce 3600 Bridge NVIDIA nForce 3050 (B)1 (B)2 PCI-Express X8 PCI-Express X8 X8 X8 4GB/s (Full-duplex) 4GB/s (Full-duplex) X4 X4 SAS Bridge I/O Hub USB Mellanox MHGH28-XTC ConnectX HCA x2 (1.2µs MPI Latency, 4X DDR 20Gb/s) Mellanox MHGH28-XTC ConnectX HCA x2 (1.2µs MPI Latency, 4X DDR 20Gb/s) Third French-Japanese PAAP Workshop

  17. Third French-Japanese PAAP Workshop

  18. Discussion (1/2) • For -point FFT, we can clearly see that communication overhead dominates the execution time. • In this case, the total working set size is only 1MB. • On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT. • Performance on 4,096 cores is over 401 GFlops, about 1.1% of theoretical peak. • Performance except for all-to-all communications is over 10 TFlops, about 26.7% of theoretical peak. Third French-Japanese PAAP Workshop

  19. Third French-Japanese PAAP Workshop

  20. Discussion (2/2) • For , the performance of the 1-D distribution is better than that of the 2-D distribution . • This is because that the total communication amount of the 1-D distribution is a half of the2-D distribution. • However, for , the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency. Third French-Japanese PAAP Workshop

  21. Third French-Japanese PAAP Workshop

  22. Conclusions • We implemented of a volumetric parallel 3-D FFT on clusters of multi-core processors. • We showed that a 2-D distribution improves performance effectively by reducing the communication time for larger numbers of MPI processes. • The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors. • We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT. Third French-Japanese PAAP Workshop

More Related