1 / 66

Parallel Computing/Programming using MPI

Parallel Computing/Programming using MPI. by R Phillip Bording CSC, Inc Center of Excellence In High Performance Computing February 3, 2004. Table of Contents. Introduction Program Structure Defining Parallel Computing Domain Decomposition. Program Structure.

alexis
Download Presentation

Parallel Computing/Programming using MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing/Programming using MPI by R Phillip Bording CSC, Inc Center of Excellence In High Performance Computing February 3, 2004

  2. Table of Contents • Introduction • Program Structure • Defining Parallel Computing • Domain Decomposition

  3. Program Structure • How do we define parallel computing? • Running a program using more than one processor • Two interesting choices exist, among others • Single Program using Multiple machines, different Data -> SPMD • Multiple Programs using Multiple machines, different Data -> MPMD

  4. MPMD Parallel Computing • The MPMD model has multiple source codes • In this model each computer has a different code and different data • The user is responsible for the program structure – which behaves differently on each machine • Typically each machine passes data to the next one, a daisy chain

  5. MPMD Parallel Computing Program 0 Program 1 Program 2 Multiple Programs - Multiple Data

  6. SPMD Parallel Computing • The SPMD model has a single source code • In the cluster model each computer has a copy of this code, each has the identical code • The user is responsible for the program structure – which can behave differently on each machine

  7. SPMD Parallel Computing • Other versions of the SPMD model exist • In the cluster model each processor has a unique id number, called rank. • The programmer can test for rank and modify the program function as needed. Single Program – Multiple Data

  8. Rank and Structure • Each processor has a name or rank • Rank=name=number • identification • Processor organization • Structure = Communicator • The commucation chain How the problem communicates defines the needed structure!

  9. Processor Rank • 16 Processors – Rank 0 to 15 15 0 6 1 5 Realize that the rank position is relative and could be structured differently as needed.

  10. Processor Rank with Structure • 16 Processors – Rank 0 to 15 15 0 0 1 2 3 6 1 4 5 6 7 5 8 9 10 11 Realize that the rank position is relative and could be structured differently as needed. 12 13 14 15

  11. SPMD Parallel Computing • Simple Code Example - almost correct Integer Rank Call MPI_INIT(return_code) Dimension Psi(0:100) Call MPI_Rank(Rank,return_code) Write(6,*) Rank Do i=0,Rank Psi(i) = i Enddo Write(6,*) (Psi(i),i=0,Rank) Call MPI_finish(return_code) End

  12. SPMD Parallel Computing • Simple Code Example - almost correct • Assuming four parallel processes • The Output looks like this • 0 • 0.0 • 2 • 0.0,1.0,2.0 • 3 • 0.0,1.0,2.0,3.0 • 1 • 0.0,1.0 MPI has no standard for the sequence of appearance in output streams

  13. SPMD Parallel Computing We’ll get back to MPI coding after we figure out how we are going to do the domain decomposition. The Omega Domain Ώ Ώ0 Ώ1 Ώ2

  14. Domain Decomposition • Subdivision of problem domain into parallel regions • Example using 2 dimensional data arrays • Linear One Dimension versus • Grid of Two Dimensions

  15. Single Processor Memory Arrays, Nx by Ny Dimension Array (Nx,Ny)

  16. Multiple Processor Memory Arrays, Nx/2 by Ny/2 4 Processors Two way decomposition

  17. Multiple Processor Memory Arrays, Nx by Ny/3 3 Processors One way decomposition

  18. Multiple Processor Memory Arrays, Nx/3 by Ny 3 Processors One way decomposition – the other way

  19. So which one is better? Or does it make a difference?? One way decomposition – one way or the other?

  20. Dimension Array (Nx,Ny) becomes Dimension Array (Nx/3,Ny) or Dimension Array (Nx,Ny/3) The Nx/3 in Fortran has shorter do loop lengths in the fastest moving index. Which could limit performance. Further, the sharing of data via message passing will have non-unit stride data access patterns.

  21. So the design issue becomes one of choice for the programming language Decide which language you need to use and then Create the decomposition plan

  22. Realize of course that a one-dimension decomposition has on Np_one processors And a two dimensional decomposition could have Np_two x Np_two. So in the design of your parallel code you would have to be aware of your resources. Further, few if any programs scale well and being realistic about the number of Processors to be used in important in deciding how hard you want to work at the parallelization effort.

  23. Processor Interconnections • Communication hardware connects the processors • These wires carry data and address information • The best interconnection is the most expensive -- all machines have a direct connection to all other machines • Because of cost we have to compromise

  24. Processor Interconnections • The slowest is also the cheapest • We could just let each machine connect to some other machine in a daisy chain fashion. • Messages would bump along until they reach their destination. • What other schemes are possible?

  25. Processor Interconnections • The Linear Daisy Chain • The Binary tree • The Fat Tree • The FLAT network • The Hypercube • The Torus • The Ring • The Cross Bar • And many, many others

  26. The Linear Daisy Chain Processor 0 Processor 1 Processor 2

  27. The Cross Bar Processor 0 Processor 1 Processor 2 Processor 0 Processor 1 Processor 2 O(1) but Switch is O(n^2) The fastest and most expensive

  28. Lets Look at the Binary TreeO(Log N)

  29. Lets Look at the Fat TreeO(Log N)+

  30. Lets Look at the Hypercube Order 1 Order 2 Order 3 Duplicate and connect the edges together

  31. Lets Look at the Binary Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Little more expensive • Root is a bottleneck

  32. Lets Look at the Fat Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Even more expensive • Root bottleneck is better managed • Each level has multiple connections

  33. Message Passing • Broadcast L I K+1 J K Broadcast from the Ith processor to all other processors

  34. Message Passing • Gather L I K+1 J K Gather from all other processors to the Ith processor

  35. Message Passing • Exchange – Based on User Topology Ring Or Linear L I K+1 J K Based on connection topology processors exchange information

  36. Just what is a message? Message Content To: You@Address From: Me@Address

  37. Just what is a message? Message Content To: You@Address:Attn Payroll From: Me@Address:Attn Payroll

  38. Message Structure • To: Address(Rank) • Content(starting array/vector/word address and length) • Tag • Data Type • Error Flag • Communicator We know who we are so From: Address(Rank) is implicit!

  39. Messaging • For every SEND we must have a RECEIVE! • The transmission is one-sided the receiver agrees to allow the sender to put the data into a memory location in the receiver process.

  40. Message Passing The interconnection topology is called a communicator – Predefined at startup However the user can define his own topology – and should as needed A problem dependent communicator – actually more than one can be defined as needed

  41. Program Structure Processor Rank 0 Processor Rank 1 Processor Rank 2 Input Loops Output Input Loops Output Input Loops Output Sync-Barriers

  42. MPI Send – Receive Send Processor K Count Receive Processor L Length ≥ Count Each cell holds one MPI_Data_Type MPI_Data_Type Must be the same! MPI_Data_Type

  43. MPI Data_Types Type Number of bytes Float 4 Double 8 Integer 4? Boolean 4 Character 1? A bit of care is need between Fortran and C data types

  44. #define MPI_BYTE           ... #define MPI_PACKED         ... #define MPI_CHAR           ...            #define MPI_SHORT          ...             #define MPI_INT            ...             #define MPI_LONG           ...             #define MPI_FLOAT          ...             #define MPI_DOUBLE         ...             #define MPI_LONG_DOUBLE   ... #define MPI_UNSIGNED_CHAR ...

  45. MPI Data_TYPE Issues • Just what is a data type? • How many bits? • Big Endian versus Little Endian? • What ever is used must be consistent! • Could type conversions be automatic or transparent??

  46. 1 Nx/2-1 Nx/2 Nx 1 Ny/2-1 Ny/2 Ny P2 P0 u(i,j) = u(i-1,j)+u(i,j) P3 P1 u(i-1,j) u(i,j) How do the processors see the variables they don’t have?

  47. The address spaces are distinct and separated.

More Related