240 likes | 399 Views
The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite. Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba. Outline. HPC Challenge (HPCC) Benchmark Suite Overview The Benchmark Tests Example Results
E N D
The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational Sciences/ Graduate School of Systems and Information Engineering University of Tsukuba First French-Japanese PAAP Workshop
Outline • HPC Challenge (HPCC) Benchmark Suite • Overview • The Benchmark Tests • Example Results • FFTE: A High-Performance FFT Library • Background • Related Works • Block Six-Step/Nine-Step FFT Algorithm • Performance Results • Conclusion and Future Work First French-Japanese PAAP Workshop
Overview of the HPC Challenge (HPCC) Benchmark Suite • HPC Challenge (HPCC) is a suite of tests that examine the performance of HPC architectures using kernels. • The suite provides benchmarks that bound the performance of many real applications as a function of memory access characteristics, e.g., • Spatial locality • Temporal locality First French-Japanese PAAP Workshop
The Benchmark Tests • The HPC Challenge benchmark consists at this time of 7 performance tests: • HPL (High Performance Linpack) • DGEMM (matrix-matrix multiplication) • STREAM (sustainable memory bandwidth) • PTRANS (A=A+B^T, parallel matrix transpose) • RandomAccess (integer updates to random memory locations) • FFT (complex 1-D discrete Fourier transform) • b_eff (MPI latency/bandwidth test) First French-Japanese PAAP Workshop
Targeted Application Areas in the Memory Access Locality Space PTRANSSTREAM HPL DGEMM CFD Radar X-section Spatial locality Applications TSP DSP RandomAccess FFT 0 Temporal locality First French-Japanese PAAP Workshop
HPCC Testing Scenarios • Local (S-STREAM, S-RandomAccess, S-DGEMM, S-FFTE) • Only single MPI process computes. • Embarrassingly parallel (EP-STREAM, EP-RandomAccess, EP-DGEMM, EP-FFTE) • All processes compute and do not communicate (explicitly). • Global (G-HPL, G-PTRANS, G-RandomAccess, G-FFTE) • All processes compute and communicate. • Network only (RandomRing Bandwidth, etc.) First French-Japanese PAAP Workshop
Sample results pagehttp://icl.cs.utk.edu/hpcc/hpcc_results.cgi First French-Japanese PAAP Workshop
The winners of the 2006 HPC Challenge Class 1 Awards • G-HPL: 259 TFlops/s • IBM Blue Gene/L (131072 Procs) • G-RandomAccess: 35 GUPS • IBM Blue Gene/L (131072 Procs) • G-FFTE: 2311 GFlop/s • IBM Blue Gene/L (131072 Procs) • EP-STREAM-Triad (system): 160TB/s • IBM Blue Gene/L (131072 Procs) First French-Japanese PAAP Workshop
FFTE: A High-Performance FFT Library • FFTE is a Fortran subroutine library for computing the Fast Fourier Transform (FFT) in one or more dimensions. • It includes complex, mixed-radix and parallel transforms. • Shared / Distributed memory parallel computers (OpenMP, MPI and OpenMP + MPI) • It also supports Intel’s SSE2/SSE3 instructions. • The FFTE library can be obtained fromhttp://www.ffte.jp First French-Japanese PAAP Workshop
Background • One goal for large FFTs is to minimize the number of cache misses. • Many FFT algorithms work well when data setsfit into a cache. • When a problem exceeds the cache size, however, the performance of these FFT algorithms decreases dramatically. • The conventional six-step FFT algorithm requires • Two multicolumn FFTs. • Three data transpositions. → The chief bottlenecks in cache-based processors. First French-Japanese PAAP Workshop
Related Works • FFTW [Frigo and Johnson (MIT)] • The recursive call is employed to access main memory hierarchically. • This technique is very effective in the case that the total amount of data is not so much greater than the cache size. • For parallel FFT, the conventional six-step FFT is used. • http://www.fftw.org • SPIRAL [Pueschel et al. (CMU)] • The goal of SPIRAL is to push the limits of automation in software and hardware development and optimization for digital signal processing (DSP) algorithms. • http://www.spiral.net First French-Japanese PAAP Workshop
Approach • Some previously presented six-step FFT algorithms separate the multicolumn FFTs from the transpositions. • Taking the opposite approach, we combinethe multicolumn FFTs and transpositions to reduce the number of cache misses. • We modify the conventional six-step FFT algorithm to reuse data in the cache memory.→ We will call it a “block six-step FFT”. First French-Japanese PAAP Workshop
Discrete Fourier Transform (DFT) • DFT is given by First French-Japanese PAAP Workshop
2-D Formulation • If has factors and then First French-Japanese PAAP Workshop
Six-Step FFT Algorithm individual Transpose -point FFTs Transpose Transpose First French-Japanese PAAP Workshop
Block Six-Step FFT Algorithm PartialTranspose individual -point FFTs Transpose PartialTranspose First French-Japanese PAAP Workshop
3-D Formulation • For very large FFTs, we should switch to a 3-D formulation. • If has factors , and then First French-Japanese PAAP Workshop
Parallel Block Nine-Step FFT PartialTranspose All-to-all comm. PartialTranspose PartialTranspose First French-Japanese PAAP Workshop
Operation Counts for -point FFT • Conventional FFT algorithms (e.g., Cooley-Tukey FFT, Stockham FFT) • Arithmetic operations: • Main memory accesses: • Block Nine-Step FFT • Arithmetic operations: • Main memory accesses (ideal case): First French-Japanese PAAP Workshop
Performance Results • To evaluate the implemented parallel FFTs, we compared • The implemented parallel FFT, named FFTE (ver 4.0, supports SSE3, using MPI) • FFTW (ver. 2.1.5, not support SSE3, using MPI) • Target parallel machine: • A 32-node dual PC SMP cluster(Irwindale 3GHz, 1GB DDR2-400 SDRAM / node, Linux 2.4.17-1smp). • Interconnected through a Gigabit Ethernet switch. • LAM/MPI 7.1.1 was used as a communication library • The compilers used were gcc 4.0.2 and g77 3.2.3. First French-Japanese PAAP Workshop
Discussion • For N = 2^29 and P = 32, the FFTE runs about 1.72 times faster than the FFTW. • The performance of the FFTE remains at a high level even for the larger problem size, owing to cache blocking. • Since the FFTW uses the conventional six-step FFT,each column FFT does not fit into the L1 data cache. • Moreover, the FFTE exploits the SSE3 instructions. • These are three reasons why the FFTE is most advantageous than the FFTW. First French-Japanese PAAP Workshop
Conclusion and Future Work • The block nine-step FFT algorithm is most advantageous with processors that have a considerable gap between the speed of the cache memory and that of the main memory. • Towards Petascale computing systems, • Exploiting the multi-level parallelism: • SIMD or Vector accelerator • Multi-core • Multi-socket • Multi-node • Reducing the number of main memory accesses. • Improving the all-to-all communication performance. • In the G-FFTE, the all-to-all communication occursthree times. First French-Japanese PAAP Workshop