10 likes | 152 Views
Parallel-FFT’s in 3D: Testing different implementation schemes Luis Acevedo-Arreguin, Benjamin Byington, Erinna Chen, Adrienne Traxler. Department of Applied Mathematics and Statistics, UC Santa Cruz. Introduction. 2. Transpose-Based parallel-FFT’s. Motivation
E N D
Parallel-FFT’s in 3D: Testing different implementation schemes Luis Acevedo-Arreguin, Benjamin Byington, Erinna Chen, Adrienne Traxler. Department of Applied Mathematics and Statistics, UC Santa Cruz Introduction 2. Transpose-Based parallel-FFT’s • Motivation • Many problems in geophysics or astrophysics can be modeled using the Navier-Stokes equations. • e.g. Earth’s magnetic field, mantle convection, solar convection zone, ocean circulation • Various numerical methods can be employed to solve these problems. Because of geometry and symmetry, many numerical codes employ a spectral method to solve the fluid dynamics. • Solutions can be written as sums of periodic functions and their corresponding weights (spectral coefficients) • Provides solutions that are continuous (can obtain solutions that are at positions not along the grid) • Particularly computationally challenging is the calculation of the non-linear terms in the equations • Using spectral coefficients, the calculation of non-linear terms for a 3-d space is O(n3) • If non-linear terms are calculated in grid (real) space, calculations are reduced to O(n2) • Fast Fourier transforms (FFTs) O(n2log n) make it numerically tractable to perform frequent transforms from real to spectral space (forward) and from spectral to real space backward) Figure 3: The starting setup: Data is initially decomposed into a stack of x-y planes, so each processor can do a local FFT in two dimensions. The data is then transposed so that each processor now holds a set of y-z slices, allowing the final transform (in the z direction) to be completed. For the reverse procedure, a similar process (transform, transpose, complete transform) is required. Figure 1: (left) Magnetic field polarity reversal caused by fluid motion in the Earth’s core (G. Glatzmaier, UCSC). (right) Layering phenomenon caused by double-diffusion similar to thermohaline staircases in the ocean (S. Stellmach, UCSC) • The transpose-based parallel-FFT utilizes FFTW to perform a 2-D FFT (optimized in-processor implementation) and then redistrubutes all data for the FFT in the third dimension. • The optimal communication scheme for transposing the third dimension to perform the final FFT is not clear. a) Complete all 2-D planes in-processor, call MPI_ALLTOALL to transpose the data, block until entire transformed data is obtained, perform final FFT b) Complete a single 2-D plane in-processor, call MPI_ALLTOALL to transpose each plane (overlapping communication with computation?), block until entire transformed data is obtained, perform final FFT c) Complete a single 2-D plane in-processor, utilize an explicit asynchronous communication scheme, perform final FFT as data is available. Schemes A naïve parallel-FFT Figure 2: Schematic representation of a naïve implementation of a 3D FFT. Rows, columns, and finally stacks are sent to slave processors, which perform 1D FFTs on the received vectors. • The Parallel Problem • If the code is serial: FFTW takes all the work out of optimizing the FFT • In general: Want to have large domains to resolve the fluid dynamics for realistic geophysical/astro-physical constants • Use large distributed clusters with large domains • Would like to use FFTW because it is the fastest algorithm (in the West) - Autotuning is done for us • Multiple schemes can be employed • Transpose-based: Data in a single-direction is in-processor, once FFT is completed, re-decompose domain in another direction • Parallel: Leave data in place and send pieces to other processors as necessary • Which is faster? Not obvious at the outset… • The data are decomposed in vectors in the three directions. Each vector, whether a row, column, or stack, is sent to a slave processor. A master processor distributes tasks to each processor, receives the 1D FFT performed to each vector by the slave processors, and resend more work to those processors that ended their previous assignments. • This scheme aims to compute 3D FFT to datasets that exhibit clustering, i.e. no homogeneity in terms of density, which consequently requires more intense computational work on some areas than others. • This scheme is tested on NERSC Franklin by using module fftw 3.1.1. 3. Parallel version of FFTW 3.2, alpha • The most recent version of FFTW includes an implementation of a 3-D parallel-FFT. • It is unclear how this is implemented: whether it completes a fully-parallel FFT (3-D domain decomposition) or implements a transpose-based FFT • However, if it autotunes the implementation of the 3-D parallel FFT it allows for greater portability and hides cluster topology. • Potentially provides a parallel implementation where the end-user needs no knowledge of parallelism in order to calculate the FFT. • Case for serial FFT: use FFTW on any computer Acknowledgments: Thanks to Professor Nicholas Brummell from UC Santa Cruz for his help on FFTs after class, and also thanks to Professor James Demmel from UC Berkeley for his teaching on parallel computing, and Vasily Volkov for his feedback on our homeworks.