280 likes | 445 Views
CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods. December 10, 2003 Jose L. Rodriguez jlrod@cs.unm.edu. Project Description. Use a Spectral Method (Fourier Method) for the equation:. Use the JST Runge-Kutta Time Integrator for each time step. Algorithm.
E N D
CS 471 Final Project2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez jlrod@cs.unm.edu
Project Description • Use a Spectral Method (Fourier Method) for the equation: • Use the JST Runge-Kutta Time Integrator for each time step.
Algorithm • For each time step that we take, we do s sub stages:
Code Development • Develop Serial C Code based off given Matlab code using FFTw libraries for fft and ifft calls • Very straightforward • Verification of code working correctly was simply comparing with Matlab result • Develop Parallel C Code based off Serial C Code • The FFTw libraries provide fft and ifft calls that do all MPI Calls for you. • The tricky part of this development was placing the data correctly on each processor for the fft and ifft calls. • Verification of code working correctly was again comparison with Matlab result
Usage of FFTw Libraries in Parallel: Function Calls Notice: Message Passing is transparent to the user
Usage of FFTw Libraries in Parallel: MPI Data Layout • The transform data used by the MPI FFTW routines is distributed: a distinct portion of it resides with each process involved in the transform. This allows the transform to be parallelized, for example, over a cluster of workstations, each with its own separate memory, so that you can take advantage of the total memory of all the processors you are parallelizing over. • In particular, the array is divided according to the rows (first dimension) of the data: each process gets a subset of the rows of the data. (This is sometimes called a "slab decomposition.") One consequence of this is that you can't take advantage of more processors than you have rows (e.g. 64x64x64 matrix can at most use 64 processors). This isn't usually much of a limitation, however, as each processor needs a fair amount of data in order for the parallel-computation benefits to outweight the communications costs. Taken from FFTw website/documentation
Usage of FFTw Libraries in Parallel: MPI Data Layout These calls needed to create fft and ifft plan, as well as find out what memory needs are to be met
Usage of FFTw Libraries in Parallel: MPI Data Layout Using Row-Major Format ilocal_x_start tells us where we are in the global 2d array (row) and ilocal_nx tells us how many elements we have on this current processor.
Parallel Results • Two versions written • A Non-Efficient version that is not optimized for FFTw MPI calls: • An extra work array is not used. • An extra un-transposing of data is done prior to coming out of fft calls. • An Efficient version that is optimized for FFTw MPI calls: • An extra work array is used • Data is left transposed so that an extra communication step of un-transposing data is not done
We begin to see some scaling, however, efficiency starts to taper off indicating that much of the time spent is in communication.
Overall, we see the same trend as N increases, i.e. some scaling as Number of Procs increases, but starts to flatten, and efficiency steadily decreases.
N=256, 10 Iterations The Sea of Black for the Non-Efficient Version
Communication goes on between each processor with MPI_SendRecv since each processor needs data from each other. We can actually see here when a fft is being performed.
8 processors and 16 processors: same trend of communication.
The sea of white for the Efficient Version. N=256, 10 Iterations
The Efficient Version uses MPI_AlltoAll for its communication between all processors.
We again can see when an fft call is being performed by each white bar for each process.
8 processors and 16 processors: same trend of communication.
Conclusions • A lot of time is spent in communication since each process communicates with each other process. • Efficiency goes down as a result because as number of process increases for a given size N, more communication is needed. • We saw some scaling, but this starts to drop off as number of processors increases (efficiency issues). • Time Spent on this project • Code Development: ~8 hours with debugging • Data Collection: ~2 days • Overall: Quite a bit of time