180 likes | 408 Views
MPI – An introduction by Jeroen van Hunen. What is MPI and why should we use it? Simple example + some basic MPI functions Other frequently used MPI functions Compiling and running code with MPI Domain decomposition Stokes solver Tracers/markers Performance Documentation.
E N D
MPI – An introduction by Jeroen van Hunen • What is MPI and why should we use it? • Simple example + some basic MPI functions • Other frequently used MPI functions • Compiling and running code with MPI • Domain decomposition • Stokes solver • Tracers/markers • Performance • Documentation
What is MPI? • Mainly a data communication tool: “Message-Passing Interface” • Allows parallel calculation on distributed memory machines • Usually Single-Program-Multiple-Data principle used: • all processors have similar tasks (e.g. in domain decomposition) • Alternative: OpenMP for shared memory machines Why should we use MPI? • If sequential calculations take too long • If sequential calculations use too much memory
Simple MPI example Code: contains definitions, macros, function prototypes • initialize MPI • ask processor ‘rank’ • ask # processors p Output for 4 processors: stop MPI
MPI_SEND and MPI_RECV syntax in C: in Fortran: in C: in Fortran:
MPI data types in C: in Fortran:
Other frequently used MPI calls Sending and receiving at the same time: no risk for deadlocks: … or overwrite send buffer with received info:
Other frequently used MPI calls Synchronizing the processors: wait for each other at the barrier: Broadcasting a message from one processor to all the others: both sending and receiving processors use same call to MPI_BCAST
Other frequently used MPI calls “Reducing” (combining) data from all processors: add, find maximum/minimum, etc. OP can be one of the following: For results to be available at all processors, use MPI_Allreduce:
Additional comments: • ‘wildcards’ are allowed in MPI calls for: • source: MPI_ANY_SOURCE • tag: MPI_ANY_TAG • MPI_SEND and MPI_RECV are ‘blocking’: • they wait until job is done
Deadlocks: Non-matching send/receive calls my block the code • Deadlock • Depending • on buffer • Safe • Don’t let processor send a message to itself • In this case use MPI_SENDRECV
Compiling and running code with MPI • Compiling: • Fortran: • mpif77 –o binary code.f • mpif90 –o binary code.f • C: • mpicc –o binary code.c • Running in general, no queueing system: • mpirun –np 4 binary • mpirun -np 4 -nolocal -machinefile mach binary • Running on Gonzales, with queueing system: • bsub -n 4 -W 8:00 prun binary
z y x Domain decomposition • Total computational domain divided into ‘equal size’ blocks • Each processor only deals with its own block • At block boundaries some information exchange necessary • Block division matters: • surface/volume ratio • number of processor bnds.
Stokes equation: Jacobi iterative solver At block boundary: MPI needed In block interior: no MPI needed N N1 N2 M1 M M2 W E W E S S1 S2 M=0.25*(N+S+E+W) M1 =0.25*(N1+S1+W) M2 =0.25*(E) M =M1+M2(using MPI_SENDRECV) Gauss-Seidel solver performs better, but is also slightly more difficult to implement. M1 =M1=M
Tracers/Markers • 2nd order Runge-Kutta scheme: • k1= dt v(t,x(t)) • k2= dt v(t+dt/2, x(t) + k1/2) • x(t+dt) = x(t) + k2 • Procedure: • Calculate x(t+dt/2) • If in procn+1: • procn sends tracer coordinates to procn+1 • procn+1 reports tracer velocity back to procn • Calculate x(t) • If in procn+1: • procn sends tracer coordinates + function values • permantently to procn+1 proc n proc n+1 k1 k2
Performance • For too small jobs communication quickly becomes the bottleneck. • This problem: • R-B convection (Ra=106) • 2-D 64x64 finite elements, 104 steps • 3-D 64x64x64 FE, 100 steps • Calculation on gonzales
Documentation Books: PDF: www.hpc.unimelb.edu.au/software/mpi-docs/mpi-book.pdf