Parallel Computing/Programming using MPI

Parallel Computing/Programming using MPI by R Phillip Bording CSC, Inc Center of Excellence In High Performance Computing February 3, 2004

Table of Contents • Introduction • Program Structure • Defining Parallel Computing • Domain Decomposition

Program Structure • How do we define parallel computing? • Running a program using more than one processor • Two interesting choices exist, among others • Single Program using Multiple machines, different Data -> SPMD • Multiple Programs using Multiple machines, different Data -> MPMD

MPMD Parallel Computing • The MPMD model has multiple source codes • In this model each computer has a different code and different data • The user is responsible for the program structure – which behaves differently on each machine • Typically each machine passes data to the next one, a daisy chain

MPMD Parallel Computing Program 0 Program 1 Program 2 Multiple Programs - Multiple Data

SPMD Parallel Computing • The SPMD model has a single source code • In the cluster model each computer has a copy of this code, each has the identical code • The user is responsible for the program structure – which can behave differently on each machine

SPMD Parallel Computing • Other versions of the SPMD model exist • In the cluster model each processor has a unique id number, called rank. • The programmer can test for rank and modify the program function as needed. Single Program – Multiple Data

Rank and Structure • Each processor has a name or rank • Rank=name=number • identification • Processor organization • Structure = Communicator • The commucation chain How the problem communicates defines the needed structure!

Processor Rank • 16 Processors – Rank 0 to 15 15 0 6 1 5 Realize that the rank position is relative and could be structured differently as needed.

Processor Rank with Structure • 16 Processors – Rank 0 to 15 15 0 0 1 2 3 6 1 4 5 6 7 5 8 9 10 11 Realize that the rank position is relative and could be structured differently as needed. 12 13 14 15

SPMD Parallel Computing • Simple Code Example - almost correct Integer Rank Call MPI_INIT(return_code) Dimension Psi(0:100) Call MPI_Rank(Rank,return_code) Write(6,*) Rank Do i=0,Rank Psi(i) = i Enddo Write(6,*) (Psi(i),i=0,Rank) Call MPI_finish(return_code) End

SPMD Parallel Computing • Simple Code Example - almost correct • Assuming four parallel processes • The Output looks like this • 0 • 0.0 • 2 • 0.0,1.0,2.0 • 3 • 0.0,1.0,2.0,3.0 • 1 • 0.0,1.0 MPI has no standard for the sequence of appearance in output streams

SPMD Parallel Computing We’ll get back to MPI coding after we figure out how we are going to do the domain decomposition. The Omega Domain Ώ Ώ0 Ώ1 Ώ2

Domain Decomposition • Subdivision of problem domain into parallel regions • Example using 2 dimensional data arrays • Linear One Dimension versus • Grid of Two Dimensions

Single Processor Memory Arrays, Nx by Ny Dimension Array (Nx,Ny)

Multiple Processor Memory Arrays, Nx/2 by Ny/2 4 Processors Two way decomposition

Multiple Processor Memory Arrays, Nx by Ny/3 3 Processors One way decomposition

Multiple Processor Memory Arrays, Nx/3 by Ny 3 Processors One way decomposition – the other way

So which one is better? Or does it make a difference?? One way decomposition – one way or the other?

Dimension Array (Nx,Ny) becomes Dimension Array (Nx/3,Ny) or Dimension Array (Nx,Ny/3) The Nx/3 in Fortran has shorter do loop lengths in the fastest moving index. Which could limit performance. Further, the sharing of data via message passing will have non-unit stride data access patterns.

So the design issue becomes one of choice for the programming language Decide which language you need to use and then Create the decomposition plan

Realize of course that a one-dimension decomposition has on Np_one processors And a two dimensional decomposition could have Np_two x Np_two. So in the design of your parallel code you would have to be aware of your resources. Further, few if any programs scale well and being realistic about the number of Processors to be used in important in deciding how hard you want to work at the parallelization effort.

Processor Interconnections • Communication hardware connects the processors • These wires carry data and address information • The best interconnection is the most expensive -- all machines have a direct connection to all other machines • Because of cost we have to compromise

Processor Interconnections • The slowest is also the cheapest • We could just let each machine connect to some other machine in a daisy chain fashion. • Messages would bump along until they reach their destination. • What other schemes are possible?

Processor Interconnections • The Linear Daisy Chain • The Binary tree • The Fat Tree • The FLAT network • The Hypercube • The Torus • The Ring • The Cross Bar • And many, many others

The Linear Daisy Chain Processor 0 Processor 1 Processor 2

The Cross Bar Processor 0 Processor 1 Processor 2 Processor 0 Processor 1 Processor 2 O(1) but Switch is O(n^2) The fastest and most expensive

Lets Look at the Binary TreeO(Log N)

Lets Look at the Fat TreeO(Log N)+

Lets Look at the Hypercube Order 1 Order 2 Order 3 Duplicate and connect the edges together

Lets Look at the Binary Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Little more expensive • Root is a bottleneck

Lets Look at the Fat Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Even more expensive • Root bottleneck is better managed • Each level has multiple connections

Message Passing • Broadcast L I K+1 J K Broadcast from the Ith processor to all other processors

Message Passing • Gather L I K+1 J K Gather from all other processors to the Ith processor

Message Passing • Exchange – Based on User Topology Ring Or Linear L I K+1 J K Based on connection topology processors exchange information

Just what is a message? Message Content To: You@Address From: Me@Address

Just what is a message? Message Content To: You@Address:Attn Payroll From: Me@Address:Attn Payroll

Message Structure • To: Address(Rank) • Content(starting array/vector/word address and length) • Tag • Data Type • Error Flag • Communicator We know who we are so From: Address(Rank) is implicit!

Messaging • For every SEND we must have a RECEIVE! • The transmission is one-sided the receiver agrees to allow the sender to put the data into a memory location in the receiver process.

Message Passing The interconnection topology is called a communicator – Predefined at startup However the user can define his own topology – and should as needed A problem dependent communicator – actually more than one can be defined as needed

Program Structure Processor Rank 0 Processor Rank 1 Processor Rank 2 Input Loops Output Input Loops Output Input Loops Output Sync-Barriers

MPI Send – Receive Send Processor K Count Receive Processor L Length ≥ Count Each cell holds one MPI_Data_Type MPI_Data_Type Must be the same! MPI_Data_Type

MPI Data_Types Type Number of bytes Float 4 Double 8 Integer 4? Boolean 4 Character 1? A bit of care is need between Fortran and C data types

#define MPI_BYTE ... #define MPI_PACKED ... #define MPI_CHAR ... #define MPI_SHORT ... #define MPI_INT ... #define MPI_LONG ... #define MPI_FLOAT ... #define MPI_DOUBLE ... #define MPI_LONG_DOUBLE ... #define MPI_UNSIGNED_CHAR ...

MPI Data_TYPE Issues • Just what is a data type? • How many bits? • Big Endian versus Little Endian? • What ever is used must be consistent! • Could type conversions be automatic or transparent??

1 Nx/2-1 Nx/2 Nx 1 Ny/2-1 Ny/2 Ny P2 P0 u(i,j) = u(i-1,j)+u(i,j) P3 P1 u(i-1,j) u(i,j) How do the processors see the variables they don’t have?

The address spaces are distinct and separated.

Parallel Computing/Programming using MPI