1 / 22

Scalable Parallelization of CPAIMD using Charm++

Scalable Parallelization of CPAIMD using Charm++. Ramkumar Vadali (on behalf of Parallel Programming Lab, UIUC). CPAIMD. Studying key chemical/biological processes Made difficult by Multiple 3-D FFTs Phases with low computation/communication ratio

kenyon
Download Presentation

Scalable Parallelization of CPAIMD using Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Parallelization of CPAIMD using Charm++ Ramkumar Vadali (on behalf of Parallel Programming Lab, UIUC)

  2. CPAIMD • Studying key chemical/biological processes • Made difficult by • Multiple 3-D FFTs • Phases with low computation/communication ratio • MPI-based solutions suffer from limited scalability

  3. Charm++ • Uses the approach of virtualization • Divide the work into VPs • Typically much more than #proc • Schedule each VP for execution • Advantage: • Computation and communication can be overlapped (between VPs) • Number of VPs can be independent of #proc • Other: load balancing, checkpointing, etc.

  4. CPAIMD Iteration

  5. Parallel Implementation

  6. Orthonormalization • At the end of every iteration, after updating electron configuration • Need to compute (from states) • a “correlation” matrix S, S[i,j] depends on entire data from states i, j • its transform T • Update the values • Computation of S has to be distributed • Compute S[i,j,p], where p is plane number • Sum over p to get S[i,j] • Actual conversion from S->T is sequential

  7. Orthonormalization

  8. Computation/Communication Overlap

  9. 3-D FFT Implementation “Dense” 3-D FFT “Sparse” 3-D FFT

  10. Parallel FFT Library • Slab-based parallelization • We do not re-implement the sequential routine • Utilize 1-D and 2-D FFT routines provided by FFTW • Allow for • Multiple 3-D FFTs simultaneously • Multiple data sets within the same set of slab objects • Useful as 3-D FFTs are frequently used in CP computations

  11. Multiple Parallel 3-D FFTs

  12. Multiple Data Sets

  13. Usage mainmodule app { extern module fftlib; readonly CProxy_SrcArray srcArray; readonly CProxy_DestArray destArray; array [1D] SrcArray: NormalSlabArray { entry SrcArray(NormalFFTinfo &conf); entry void doIt(int id); }; array [1D] DestArray: NormalSlabArray { entry DestArray(NormalFFTinfo &conf); entry void doIt(int id); }; mainchare main { entry main(); };

  14. Usage (contd.) main::main(CkArgMsg *m) { int y; int dim=16; int srcDim[2] = {dim, dim}; int destDim[2] = {dim, dim}; srcArray = CProxy_SrcArray::ckNew(); destArray = CProxy_DestArray::ckNew(); NormalFFTinfo src_info(srcArray,destArray,srcDim,destDim,true,NULL), dest_info(srcArray, destArray, srcDim, destDim, false, NULL); for (y = 0; y < dim; y++) { destArray(y).insert(dest_info); srcArray(y).insert(src_info); } destArray.doneInserting(); srcArray.doneInserting(); srcArray.doFFT(0, 0); // That is it! }

  15. AMPI Interface for Library • AMPI (Adaptive MPI) – for running MPI programs using Charm++ • Using user level threads • Still have dynamic load balancing • Conversion from MPI->AMPI is easy • Need to remove global variables • Has been done for large legacy codes • Rocket simulation (RocFlo, RocSolid)

  16. Initialize the library Call start on each processor Wait for operation to finish Three Phases init_fftlib( MPI_Comm _srcComm, MPI_Comm _destComm, int *srcMap, int *destMap, int size[3]); start_fft( complex *srcData, int nplanes); wait_fft( complex *destData, int nplanes);

  17. Problem Decomposition • Plane decomposition • When P < total # of planes • One or more planes assigned to each processor • Map is used to determine assignment • 3d fft performed by doing 2d fft on the planes in X and Y dimensions and then a transpose and an FFT along the Z dimension

  18. Problem Decompostion(contd.) • Pencil Decomposition • When P < total # of planes • Each processor is assigned a “pencil” • Perform 1D FFT sequentially, transpose • 3 sets of 1D FFTs • 2 transposes

  19. Multiple 3D FFTs • Create a new communicator for each • Initialize the library for each • AMPI internally creates data structures for doing the 3D FFT

  20. Communication Optimization • Several all to all transposes in 3d fft • All-to-all communication • Can be optimized • Main overhead is the software overhead of sending a large number of messages • Message combining can be used to optimize all to all personalized communication (AAPC) • Implemented as part of the communication library in AMPI/Charm++

  21. Communication Optimization Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1

  22. Summary • We achieved scalable parallelization of CPAIMD • Using paralle FFT libraries eases programming significantly • Can use AMPI/Charm++ interface

More Related