220 likes | 306 Views
Scalable Parallelization of CPAIMD using Charm++. Ramkumar Vadali (on behalf of Parallel Programming Lab, UIUC). CPAIMD. Studying key chemical/biological processes Made difficult by Multiple 3-D FFTs Phases with low computation/communication ratio
E N D
Scalable Parallelization of CPAIMD using Charm++ Ramkumar Vadali (on behalf of Parallel Programming Lab, UIUC)
CPAIMD • Studying key chemical/biological processes • Made difficult by • Multiple 3-D FFTs • Phases with low computation/communication ratio • MPI-based solutions suffer from limited scalability
Charm++ • Uses the approach of virtualization • Divide the work into VPs • Typically much more than #proc • Schedule each VP for execution • Advantage: • Computation and communication can be overlapped (between VPs) • Number of VPs can be independent of #proc • Other: load balancing, checkpointing, etc.
Orthonormalization • At the end of every iteration, after updating electron configuration • Need to compute (from states) • a “correlation” matrix S, S[i,j] depends on entire data from states i, j • its transform T • Update the values • Computation of S has to be distributed • Compute S[i,j,p], where p is plane number • Sum over p to get S[i,j] • Actual conversion from S->T is sequential
3-D FFT Implementation “Dense” 3-D FFT “Sparse” 3-D FFT
Parallel FFT Library • Slab-based parallelization • We do not re-implement the sequential routine • Utilize 1-D and 2-D FFT routines provided by FFTW • Allow for • Multiple 3-D FFTs simultaneously • Multiple data sets within the same set of slab objects • Useful as 3-D FFTs are frequently used in CP computations
Usage mainmodule app { extern module fftlib; readonly CProxy_SrcArray srcArray; readonly CProxy_DestArray destArray; array [1D] SrcArray: NormalSlabArray { entry SrcArray(NormalFFTinfo &conf); entry void doIt(int id); }; array [1D] DestArray: NormalSlabArray { entry DestArray(NormalFFTinfo &conf); entry void doIt(int id); }; mainchare main { entry main(); };
Usage (contd.) main::main(CkArgMsg *m) { int y; int dim=16; int srcDim[2] = {dim, dim}; int destDim[2] = {dim, dim}; srcArray = CProxy_SrcArray::ckNew(); destArray = CProxy_DestArray::ckNew(); NormalFFTinfo src_info(srcArray,destArray,srcDim,destDim,true,NULL), dest_info(srcArray, destArray, srcDim, destDim, false, NULL); for (y = 0; y < dim; y++) { destArray(y).insert(dest_info); srcArray(y).insert(src_info); } destArray.doneInserting(); srcArray.doneInserting(); srcArray.doFFT(0, 0); // That is it! }
AMPI Interface for Library • AMPI (Adaptive MPI) – for running MPI programs using Charm++ • Using user level threads • Still have dynamic load balancing • Conversion from MPI->AMPI is easy • Need to remove global variables • Has been done for large legacy codes • Rocket simulation (RocFlo, RocSolid)
Initialize the library Call start on each processor Wait for operation to finish Three Phases init_fftlib( MPI_Comm _srcComm, MPI_Comm _destComm, int *srcMap, int *destMap, int size[3]); start_fft( complex *srcData, int nplanes); wait_fft( complex *destData, int nplanes);
Problem Decomposition • Plane decomposition • When P < total # of planes • One or more planes assigned to each processor • Map is used to determine assignment • 3d fft performed by doing 2d fft on the planes in X and Y dimensions and then a transpose and an FFT along the Z dimension
Problem Decompostion(contd.) • Pencil Decomposition • When P < total # of planes • Each processor is assigned a “pencil” • Perform 1D FFT sequentially, transpose • 3 sets of 1D FFTs • 2 transposes
Multiple 3D FFTs • Create a new communicator for each • Initialize the library for each • AMPI internally creates data structures for doing the 3D FFT
Communication Optimization • Several all to all transposes in 3d fft • All-to-all communication • Can be optimized • Main overhead is the software overhead of sending a large number of messages • Message combining can be used to optimize all to all personalized communication (AAPC) • Implemented as part of the communication library in AMPI/Charm++
Communication Optimization Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1
Summary • We achieved scalable parallelization of CPAIMD • Using paralle FFT libraries eases programming significantly • Can use AMPI/Charm++ interface