Introduction to Parallel Computing and MPI

FLASH TutorialMay 13, 2004 Parallel Computing and MPI

What is Parallel Computing ?And why is it useful • Parallel Computing is more than one cpu working together on one problem • It is useful when • Large problem, could take very long • Data size too big to fit in the memory of one processor • When to parallelize • Problem could be subdivided into relatively independent tasks • How much to parallelize • While the speedup in computation relative to single processor is of the order of number of processors

Parallel paradigms • SIMD – Single instruction multiple data • Processors work in lock-step • MIMD – Multiple instruction multiple data • Processors do their own thing with occasional synchronization • Shared Memory • One way communications • Distributed Memory • Message passing • Loosely Coupled • When the process on each cpu is fairly self contained and relatively independent of processes on other cpu’s • Tightly Coupled • When cpu’s need to communicate with each other frequently

How to Parallelize • Divide a problem into a set of mostly independent tasks • Partitioning a problem • Tasks get their own data • Localize a task • They operate on their own data for the most part • Try to make it self contained • Occasionally • Data may be needed from other tasks • Inter-process communication • Synchronization may be required between tasks • Global operation • Map tasks to different processors • One processor may get more than one task • Task distribution should be well balanced

New Code Components • Initialization • Query parallel state • Identify process • Identify number of processes • Exchange data between processes • Local, Global • Synchronization • Barriers, Blocking Communication, Locks • Finalization

MPI • Message Passing Interface, standard for distributed memory model of parallelism • MPI-2 will support one-way communication, commonly associated with shared memory operations • Works with communicators; a collection of processors • MPI_COMM_WORLD default • Has support for lowest level communication operations and composite operations • Has blocking and non-blocking operations

Communicators COMM1 COMM2

Low level Operations in MPI • MPI_Init • MPI_Comm_size • Find number of processors • MPI_Comm_rank • Find my processor number • MPI_Send/Recv • Communicate with other processors one at a time • MPI_Bcast • Global data transmission • MPI_Barrier • Synchronization • MPI_Finalize

Advanced Constructs in MPI • Composite Operations • Gather/Scatter • Allreduce • Alltoall • Cartesian grid operations • Shift • Communicators • Creating subgroups of processors to operate on • User-defined Datatypes • I/O • Parallel file operations

0 1 2 0 1 All to All 2 3 0 1 Point to Point 2 3 Collective 0 1 2 3 0 1 2 3 One to All Broadcast Shift Communication Patterns

Communication Overheads • Latency vs. Bandwidth • Blocking vs. Non-Blocking • Overlap • Buffering and copy • Scale of communication • Nearest neighbor • Short range • Long range • Volume of data • Resource contention for links • Efficiency • Hardware, software, communication method

Parallelism in FLASH • Short range communications • Nearest neighbor • Long range communications • Regridding • Other global operations • All-reduce operations on physical quantities • Specific to solvers • multi-pole method • FFT based solvers

Domain Decomposition P1 P0 P2 P3

Border Cells / Ghost Points • When splitting up solnData, need data from other processors. • Need a layer of cells from each processor • Need to update each time step

Border/Ghost Cells Short Range communication

MPI_Cart_create Create topology MPE_Decomp1d Domain decomp on topology MPI_Cart_shift Who’s on the left/right? MPI_SendRecv Ghost cells left MPI_SendRecv Ghost cells right MPI_Comm_rank MPI_Comm_size Manually decompose grid over processors Calculate left/right MPI_Send/MPI_Recv Carefully to avoid deadlocks Two MPI Methods for doing it

Adaptive Grid Issues • Discretization not uniform • Simple left-right guard cell fills inadequate • Adjacent grid points may not be mapped to the nearest neighbors in processors topology • Redistribution of work necessary

Regridding • Change in number of cells/blocks • Some processors get more work than others • Load imbalance • Redistribute data to even out work on all processors • Long range communications • Large quantities of data moved

Regridding

Other parallel operations in FLASH • Global max/sum etc (Allreduce) • Physical quantities • In solvers • Performance monitoring • Alltoall • FFT based solver on UG • User defined datatypes and file operations • Parallel I/O

Introduction to Parallel Computing and MPI

Introduction to Parallel Computing and MPI

Presentation Transcript

IAER 2004 Summit Phoenix/Scottdale May 13, 2004

11-13 May 2004

May 2004

May 13, 1884 May 13, 2009

2004 PEBLO CONFERENCE May 2004

ISCA 2004 Tutorial

May 2004

Presentation to Norwegian Delegation May 13, 2004

Purchasing Directors’ Meeting May 13, 2004

May 2004

FLASH Tutorial

May, 2004

Flash shape tweening tutorial

CUBS SGM2, May 13, 2004

MBA Subprime Lending Conference May 13, 2004

COMRISK Workshop Norden 12./13. May 2004

Developing Flash MX 2004 Components

SwarmFest 2004 May 11, 2004

May 13

Flash Tutorial

11-13 May 2004