280 likes | 503 Views
Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library. Dmitry Pekurovsky dmitry@sdsc.edu San Diego Supercomputer Center. Examples of turbulence are common. Disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales.
E N D
Scaling of a Pseudospectral DNS Turbulence Code and a Parallel 3D FFT Library Dmitry Pekurovsky dmitry@sdsc.edu San Diego Supercomputer Center
Examples of turbulence are common Disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales
DNS examples of special interest • Study effects on local isotropy and intermittency of Reynolds and Schmidt numbers • Study mixing of passive contaminants and active chemical species of a range of molecular diffusivities • simulations at high Reynolds and Schmidt numbers • Study dispersion in the Lagrangian frame • Capture effects of anisotropy • Include Coriolis forces (rotating frame)
Powerful computers and efficient algorithms are required for DNS • Capturing the physics at small scales requires simulation on grids of the largest possible size • Largest simulations in U.S.: 20483 • Worldwide: Earth Simulator 40963 (Kaneda et al. 2002) • Desired in the near future: 40963 and beyond, with more complex physics • NSF turbulence model problem: 122883 • Memory requirements grow as N3, the number of grid points • Computational work grows even faster than the number of grid points, since the number of time steps must also grow to ensure stability
DNS code P.K.Yeung, D. DonzisGeorgia Tech • Code written in Fortran 90 with MPI • Time evolution: Runge Kutta 2nd order • Spatial derivative calculation: pseudospectral method • Typically, FFTs are done in all 3 dimensions. • Consider 3D FFT as compute-intensive kernel representative of performance characteristics of the full code (3D FFT is ~70% of the computation, and ~95% communication) • Input is real, output is complex; or vice versa SDSC SAC (Strategic Applications Collaboration) project • focus on scalability
Typical parallel 3D FFT today • Grid: N x N x N – how to do parallel FFT? • Transpose strategy, as opposed to direct strategy. • That is, make sure all data in direction of 1D transform resides in one processor’s memory. • Parallelize over remaining dimensions. • 1D decomposition • Each CPU has several planes (slabs) • Local geometry: N x N x L, where L=N/P • Transform in 2 dimensions local to the planes first • Use an established library for 1D FFT, e.g. ESSL, FFTW • Transpose data to localize third dimension • Local geometry: N x L x N • Complete transform by operating along third dimension • Most everyone has been using this approach. Examples: • PESSL • NAS Parallel Benchmarks
1D decomposition scaling • Communication: 1 call to MPI_Alltoall or equivalent all-to-all exchange. • Demanding operation. Performance depends on interconnect bisection bandwidth. Definition: • Cut the network in two. Count the links cut. • This parameter depends on bandwidth of each link, and also network topology (e.g. fat-tree for Datastar, 3D torus for Blue Gene) • Can scale OK (depending on interconnect), but only up to a point: P <= N • For 4096^3 grid, no more than 4096 processors. • => At Petascale a new approach is needed
One possible solution Hybrid MPI/OpenMP approach, using shared memory multiprocessor platforms and/or multi-core processors. • Each MPI task spawns Q threads • Do MPI communication over P=N tasks, with Q threads working on one plane. • Often does not work well due to issues such as false sharing and thread operations overhead. • Still limited by N x Q cores.
2D decomposition z x y
2D decomposition • 2D decomposition: processors form a square grid P1 x P2 • Columns (pencils) instead of slabs • Local geometry: • Stage 1: N x K x M • Stage 2: K x N x M • Stage 3: K x M x N • K = N/P1, M=N/P2 • At each stage, transform along the local dimension (N) • Software scaling limit removed! Now the limit is P <= N2
Performance of DNS (2D) on Blue Gene at SDSC and IBM’s T.J.Watson Lab; and SDSC’s Datastar VN: Two processors per node CO: One processor per node
Communication and performance: 2D vs. 1D • 1D: 1 All-to-all exchange over P tasks • 2D: 2 All-to-all exchanges in groups: • P2 groups of P1 tasks each • P1 groups of P2 tasks each • Which is better? • Message size decreases if the exchange group size increases • Some links are unused in each of the 2 transposes, and data volume is the same as in 1D case • 1D decomposition ends up winning • But again: 1D is only good up to P = N
Choosing P1 and P2 • P1 x P2 = P • If P <= N, choose P1 = 1 (1D decomposition is better) • If P >N, in most cases it’s best to choose P1 = P/N • Better communication pattern • In some cases, better cache use • BG observation: a sweetspot occurs when P1 = Px of the torus, due to wraparound links usage. • How do I know Px on my machine? • Ask your consultant
P3DFFT • Open source library for efficient, highly scalable 3D FFT on parallel platforms • Built on top of an optimized 1D FFT library • Currently ESSL or FFTW • In the future, more libraries • Uses 2D decomposition, with option for 1D. • Available at http://www.sdsc.edu/us/resources/p3dfft.php
P3DFFT: more features • Implements real-to-complex (R2C) and complex-to-real (C2R) 3D transforms • Written in Fortran 90 with MPI • Implemented as F90 module • Single or double precision • Arbitrary dimensions • Handles many uneven cases (Ni does not have to be a factor of Pj) • Assumes Ni >= Pj • Can do either in-place transform or otherwise • In which case the input and output arrays should not overlap • Memory requirements: input and output arrays + 1 buffer
P3DFFT: Storage • R2C transform input: • Global: (Nx,Ny,Nz) real array • Local: (Nx,Ny/P1,Nz/P2) real array • R2C output: • Global: (Nx/2,Ny,Nz) complex array • Conjugate symmetry: F(N-i) = F*(i) • F(N/2+1) through F(N-1) are redundant • F(0) and F(N/2) are real – pack them in one pseudo-complex pair • Total N/2 complex storage units • Local: (Nx/(2P1),Ny/P2,Nz) complex • C2R: input and output interchanged from R2C
Building P3DFFT • Subdirectory build/ • Edit makefile for your architecture • Specify -DSINGLE_PREC or -DDOUBLE_PREC • Default is single precision • Specify -DESSL or -DFFTW • Choose P3DFFT_ROOT • Type ‘make’ – will compile and install the library in P3DFFT_ROOT directory • On SDSC platforms, it is already installed in /usr/local/apps/p3dfft • Subdirectory sample/ • Contains sample programs • For manual see README file in upper level directory
Using P3DFFT • ‘use p3dfft’ • Defines mytype – 4 or 8 byte reals • Initialization: p3dfft_setup(proc_dims,nx,ny,nz) • procs_dims(2) – two integers, P1 and P2 • For 1D decomposition, set P1 = 1 • Sets up the square 2D grid of MPI communicators in rows and columns • Initializes local array dimensions • Local dimensions: get_dims(istart,iend,isize,Q) integer istart(3),iend(3),isize(3) • Set Q=1 for R2C input, Q=2 for R2C output
Using P3DFFT (2) • Now allocate your arrays • When ready to call Fourier Transform: • p3dfft_ftran_r2c(IN,OUT,flg_inplace) – Forward transform • exp(-ikx/N) • p3dfft_btran_c2r(IN,OUT,flg_inplace) – Backward (inverse) transform • exp(ikx/N) • flg_inplace: boolean, true if doing an in-place transform
Implementation and Performance Each processor doing a (large) number of 1D FFT • Memory access stride is a crucial parameter • FFT in X, stride 1: 28% peak on Datastar • Transpose in X & Y • FFT in Y, stride Nx/P1: 25% peak on Datastar • Transpose in Y & Z • FFT in Z, stride (Nx Ny)/P: 10% peak on Datastar • Alternative is to rearrange in memory space so that stride is 1. Involves an extra memory load. • ESSL is about 3 times faster than FFTW
Future work • Include more array distribution schemes • Expand options for memory ordering • Include more transform types • Derivatives • Complex-to-complex • C interface • Convenience functions • Break down in stages • Expand choice of 1D FFT libraries • Questions? Comments? dmitry@sdsc.edu
Lab assignments • Write a program that would: • initialize P3DFFT • allocate input and output 3D arrays (choose global size not too big: 1283 or 2563 is reasonable) • initialize arrays to your favorite functional form such as a sine wave • Do a forward transform • Compare the data with what you expect • Do a backward transform • After multiplying by a normalization factor, compare the results with original array • Time performance of the computational part • Run at a few grid sizes and processor counts, study performance • Do the same for in-place transform (output overwrites input) If you get lost, look in /sample directory. But try not to look!