580 likes | 795 Views
Parallel Programming using MPI. by Phil Bording Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland. Table of Contents. Introduction Program Structure Defining Parallel Computing Domain Decomposition. Rank and Structure. Each processor has a name or rank
E N D
Parallel Programming using MPI by Phil Bording Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland Parallel Computing - MPI/OpenMP
Table of Contents • Introduction • Program Structure • Defining Parallel Computing • Domain Decomposition Parallel Computing - MPI/OpenMP
Rank and Structure • Each processor has a name or rank • Rank=name=number • identification • Processor organization • Structure = Communicator • The commucation chain How the problem communicates defines the needed structure! Parallel Computing - MPI/OpenMP
Processor Rank • 16 Processors – Rank 0 to 15 15 0 6 1 5 Realize that the rank position is relative and could be structured differently as needed. Parallel Computing - MPI/OpenMP
Processor Rank with Structure • 16 Processors – Rank 0 to 15 15 0 0 1 2 3 6 1 4 5 6 7 5 8 9 10 11 Realize that the rank position is relative and could be structured differently as needed. 12 13 14 15 Parallel Computing - MPI/OpenMP
SPMD Parallel Computing • Simple Code Example - almost correct Integer Rank Call MPI_INIT(return_code) Dimension Psi(0:100) Call MPI_Rank(Rank,return_code) Write(6,*) Rank Do i=0,Rank Psi(i) = i Enddo Write(6,*) (Psi(i),i=0,Rank) Call MPI_finish(return_code) End Parallel Computing - MPI/OpenMP
SPMD Parallel Computing • Simple Code Example - almost correct • Assuming four parallel processes • The Output looks like this • 0 • 0.0 • 2 • 0.0,1.0,2.0 • 3 • 0.0,1.0,2.0,3.0 • 1 • 0.0,1.0 MPI has no standard for the sequence of appearance in output streams Parallel Computing - MPI/OpenMP
SPMD Parallel Computing We’ll get back to MPI coding after we figure out how we are going to do the domain decomposition. The Omega Domain Ώ Ώ0 Ώ1 Ώ2 Parallel Computing - MPI/OpenMP
Discussion Time Parallel Computing - MPI/OpenMP
Domain Decomposition • Subdivision of problem domain into parallel regions • Example using 2 dimensional data arrays • Linear One Dimension versus • Grid of Two Dimensions Parallel Computing - MPI/OpenMP
Single Processor Memory Arrays, Nx by Ny Dimension Array (Nx,Ny) Parallel Computing - MPI/OpenMP
Multiple Processor Memory Arrays, Nx/2 by Ny/2 4 Processors Two way decomposition Parallel Computing - MPI/OpenMP
Multiple Processor Memory Arrays, Nx by Ny/3 3 Processors One way decomposition Parallel Computing - MPI/OpenMP
Multiple Processor Memory Arrays, Nx/3 by Ny 3 Processors One way decomposition – the other way Parallel Computing - MPI/OpenMP
So which one is better? Or does it make a difference?? One way decomposition – one way or the other? Parallel Computing - MPI/OpenMP
Dimension Array (Nx,Ny) becomes Dimension Array (Nx/3,Ny) or Dimension Array (Nx,Ny/3) The Nx/3 in Fortran has shorter do loop lengths in the fastest moving index. Which could limit performance. Further, the sharing of data via message passing will have non-unit stride data access patterns. Parallel Computing - MPI/OpenMP
So the design issue becomes one of choice for the programming language Decide which language you need to use and then Create the decomposition plan Parallel Computing - MPI/OpenMP
Realize of course that a one-dimension decomposition has on Np_one processors And a two dimensional decomposition could have Np_two x Np_two. So in the design of your parallel code you would have to be aware of your resources. Further, few if any programs scale well and being realistic about the number of Processors to be used in important in deciding how hard you want to work at the parallelization effort. Parallel Computing - MPI/OpenMP
Discussion Time Parallel Computing - MPI/OpenMP
Processor Interconnections • Communication hardware connects the processors • These wires carry data and address information • The best interconnection is the most expensive -- all machines have a direct connection to all other machines • Because of cost we have to compromise Parallel Computing - MPI/OpenMP
Processor Interconnections • The slowest is also the cheapest • We could just let each machine connect to some other machine in a daisy chain fashion. • Messages would bump along until they reach their destination. • What other schemes are possible? Parallel Computing - MPI/OpenMP
Processor Interconnections • The Linear Daisy Chain • The Binary tree • The Fat Tree • The FLAT network • The Hypercube • The Torus • The Ring • The Cross Bar • And many, many others Parallel Computing - MPI/OpenMP
The Linear Daisy Chain Processor 0 Processor 1 Processor 2 Parallel Computing - MPI/OpenMP
The Cross Bar Processor 0 Processor 1 Processor 2 Processor 0 Processor 1 Processor 2 O(1) but Switch is O(n^2) The fastest and most expensive Parallel Computing - MPI/OpenMP
Lets Look at the Binary TreeO(Log N) Parallel Computing - MPI/OpenMP
Lets Look at the Fat TreeO(Log N)+ Parallel Computing - MPI/OpenMP
Lets Look at the Hypercube Order 1 Order 2 Order 3 Duplicate and connect the edges together Parallel Computing - MPI/OpenMP
Lets Look at the Binary Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Little more expensive • Root is a bottleneck Parallel Computing - MPI/OpenMP
Lets Look at the Fat Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Even more expensive • Root bottleneck is better managed • Each level has multiple connections Parallel Computing - MPI/OpenMP
Fortran MPI Commands INCLUDE 'mpif.h' MPI_INIT(ierr) MPI_COMM_SIZE(MPI_COMM_WORLD,p,ierr) MPI_COMM_RANK(MPI_COMM_WORLD, my_rank,ierr) Parallel Computing - MPI/OpenMP
MPI Commands MPI_SCATTER(A,chunkA,MPI_REAL,A_local,chunkA, MPI_REAL,0,MPI_COMM_WORLD, ierr) MPI_SCATTER(b,chunkb,MPI_REAL,b_local,chunkb, MPI_REAL,0,MPI_COMM_WORLD, ierr) MPI_ALLGATHER(x_local,chunkb,MPI_REAL,x_new, chunkb,MPI_REAL, MPI_COMM_WORLD,ierr) Parallel Computing - MPI/OpenMP
Message Passing • Broadcast L I K+1 J K Broadcast from the Ith processor to all other processors Parallel Computing - MPI/OpenMP
Message Passing • Broadcast - Scatter L I K+1 Chunk Is of Uniform length J K Scatter from the Ith processor to all other processors Parallel Computing - MPI/OpenMP
Message Passing • Gather L I K+1 J K Gather from all other processors to the Ith processor Parallel Computing - MPI/OpenMP
Message Passing • Broadcast - Gather L I K+1 Chunk Is of Uniform length J K Gather from the Ith processor to all other processors Parallel Computing - MPI/OpenMP
Message Passing • Exchange – Based on User Topology Ring Or Linear L I K+1 J K Based on connection topology processors exchange information Parallel Computing - MPI/OpenMP
Just what is a message? Message Content To: You@Address From: Me@Address Parallel Computing - MPI/OpenMP
Just what is a message? Message Content To: You@Address:Attn Payroll From: Me@Address:Attn Payroll Parallel Computing - MPI/OpenMP
Message Structure • To: Address(Rank) • Content(starting array/vector/word address and length) • Tag • Data Type • Error Flag • Communicator We know who we are so From: Address(Rank) is implicit! Parallel Computing - MPI/OpenMP
Messaging • For every SEND we must have a RECEIVE! • The transmission is one-sided the receiver agrees to allow the sender to put the data into a memory location in the receiver process. Parallel Computing - MPI/OpenMP
Message Passing The interconnection topology is called a communicator – Predefined at startup However the user can define his own topology – and should as needed A problem dependent communicator – actually more than one can be defined as needed Parallel Computing - MPI/OpenMP
Program Structure Processor Rank 0 Processor Rank 1 Processor Rank 2 Input Loops Output Input Loops Output Input Loops Output Sync-Barriers Parallel Computing - MPI/OpenMP
MPI Send – Receive Send Processor K Count Receive Processor L Length ≥ Count Each cell holds one MPI_Data_Type MPI_Data_Type Must be the same! MPI_Data_Type Parallel Computing - MPI/OpenMP
MPI Data_Types Type Number of bytes Float 4 Double 8 Integer 4? Boolean 4 Character 1? A bit of care is need between Fortran and C data types Parallel Computing - MPI/OpenMP
#define MPI_BYTE ... #define MPI_PACKED ... #define MPI_CHAR ... #define MPI_SHORT ... #define MPI_INT ... #define MPI_LONG ... #define MPI_FLOAT ... #define MPI_DOUBLE ... #define MPI_LONG_DOUBLE ... #define MPI_UNSIGNED_CHAR ... Parallel Computing - MPI/OpenMP
MPI Data_TYPE Issues • Just what is a data type? • How many bits? • Big Endian versus Little Endian? • What ever is used must be consistent! • Could type conversions be automatic or transparent?? Parallel Computing - MPI/OpenMP
The Grid for 2 by 2 Parallel Computing - MPI/OpenMP
2D 1 Nx/2-1 Nx/2 Nx 1 Ny/2-1 Ny/2 Ny P2 P0 u(i,j) = u(i-1,j)+u(i,j) P3 P1 u(i-1,j) u(i,j) How do the processors see the variables they don’t have? Parallel Computing - MPI/OpenMP