Scalable Parallelization of CPAIMD using Charm++

Scalable Parallelization of CPAIMD using Charm++ Ramkumar Vadali (on behalf of Parallel Programming Lab, UIUC)

CPAIMD • Studying key chemical/biological processes • Made difficult by • Multiple 3-D FFTs • Phases with low computation/communication ratio • MPI-based solutions suffer from limited scalability

Charm++ • Uses the approach of virtualization • Divide the work into VPs • Typically much more than #proc • Schedule each VP for execution • Advantage: • Computation and communication can be overlapped (between VPs) • Number of VPs can be independent of #proc • Other: load balancing, checkpointing, etc.

CPAIMD Iteration

Parallel Implementation

Orthonormalization • At the end of every iteration, after updating electron configuration • Need to compute (from states) • a “correlation” matrix S, S[i,j] depends on entire data from states i, j • its transform T • Update the values • Computation of S has to be distributed • Compute S[i,j,p], where p is plane number • Sum over p to get S[i,j] • Actual conversion from S->T is sequential

Orthonormalization

Computation/Communication Overlap

3-D FFT Implementation “Dense” 3-D FFT “Sparse” 3-D FFT

Parallel FFT Library • Slab-based parallelization • We do not re-implement the sequential routine • Utilize 1-D and 2-D FFT routines provided by FFTW • Allow for • Multiple 3-D FFTs simultaneously • Multiple data sets within the same set of slab objects • Useful as 3-D FFTs are frequently used in CP computations

Multiple Parallel 3-D FFTs

Multiple Data Sets

Usage mainmodule app { extern module fftlib; readonly CProxy_SrcArray srcArray; readonly CProxy_DestArray destArray; array [1D] SrcArray: NormalSlabArray { entry SrcArray(NormalFFTinfo &conf); entry void doIt(int id); }; array [1D] DestArray: NormalSlabArray { entry DestArray(NormalFFTinfo &conf); entry void doIt(int id); }; mainchare main { entry main(); };

Usage (contd.) main::main(CkArgMsg *m) { int y; int dim=16; int srcDim[2] = {dim, dim}; int destDim[2] = {dim, dim}; srcArray = CProxy_SrcArray::ckNew(); destArray = CProxy_DestArray::ckNew(); NormalFFTinfo src_info(srcArray,destArray,srcDim,destDim,true,NULL), dest_info(srcArray, destArray, srcDim, destDim, false, NULL); for (y = 0; y < dim; y++) { destArray(y).insert(dest_info); srcArray(y).insert(src_info); } destArray.doneInserting(); srcArray.doneInserting(); srcArray.doFFT(0, 0); // That is it! }

AMPI Interface for Library • AMPI (Adaptive MPI) – for running MPI programs using Charm++ • Using user level threads • Still have dynamic load balancing • Conversion from MPI->AMPI is easy • Need to remove global variables • Has been done for large legacy codes • Rocket simulation (RocFlo, RocSolid)

Initialize the library Call start on each processor Wait for operation to finish Three Phases init_fftlib( MPI_Comm _srcComm, MPI_Comm _destComm, int *srcMap, int *destMap, int size[3]); start_fft( complex *srcData, int nplanes); wait_fft( complex *destData, int nplanes);

Problem Decomposition • Plane decomposition • When P < total # of planes • One or more planes assigned to each processor • Map is used to determine assignment • 3d fft performed by doing 2d fft on the planes in X and Y dimensions and then a transpose and an FFT along the Z dimension

Problem Decompostion(contd.) • Pencil Decomposition • When P < total # of planes • Each processor is assigned a “pencil” • Perform 1D FFT sequentially, transpose • 3 sets of 1D FFTs • 2 transposes

Multiple 3D FFTs • Create a new communicator for each • Initialize the library for each • AMPI internally creates data structures for doing the 3D FFT

Communication Optimization • Several all to all transposes in 3d fft • All-to-all communication • Can be optimized • Main overhead is the software overhead of sending a large number of messages • Message combining can be used to optimize all to all personalized communication (AAPC) • Implemented as part of the communication library in AMPI/Charm++

Communication Optimization Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1

Summary • We achieved scalable parallelization of CPAIMD • Using paralle FFT libraries eases programming significantly • Can use AMPI/Charm++ interface

Scalable Parallelization of CPAIMD using Charm++

Scalable Parallelization of CPAIMD using Charm++

Presentation Transcript

Optimizing Quantum Chemistry using Charm++

Parallelization: Conway’s Game of Life

Parallelization of Expert System

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

Loop Parallelization

Parallelization

Cooperative Parallelization

HW5: Parallelization

Automatic Parallelization

Scalable Clustering using Multiple GPUs

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors

Parallelization of urbanSTREAM

Parallelization of RHSEG

Parallelization of RHSEG

Parallelization Strategies

Parallelization and Characterization of Pattern Matching using GPUs

PARALLELIZATION OF MULTIPLE BACKSOLVES

Scalable and transparent parallelization of multiplayer games

Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms

Scalable Models Using Model Transformation

Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms

Using Charm++ with Arrays