1 / 58

Parallel Programming using MPI

Parallel Programming using MPI. by Phil Bording Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland. Table of Contents. Introduction Program Structure Defining Parallel Computing Domain Decomposition. Rank and Structure. Each processor has a name or rank

whitney
Download Presentation

Parallel Programming using MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming using MPI by Phil Bording Husky Energy Chair in Oil and Gas Research Memorial University of Newfoundland Parallel Computing - MPI/OpenMP

  2. Table of Contents • Introduction • Program Structure • Defining Parallel Computing • Domain Decomposition Parallel Computing - MPI/OpenMP

  3. Rank and Structure • Each processor has a name or rank • Rank=name=number • identification • Processor organization • Structure = Communicator • The commucation chain How the problem communicates defines the needed structure! Parallel Computing - MPI/OpenMP

  4. Processor Rank • 16 Processors – Rank 0 to 15 15 0 6 1 5 Realize that the rank position is relative and could be structured differently as needed. Parallel Computing - MPI/OpenMP

  5. Processor Rank with Structure • 16 Processors – Rank 0 to 15 15 0 0 1 2 3 6 1 4 5 6 7 5 8 9 10 11 Realize that the rank position is relative and could be structured differently as needed. 12 13 14 15 Parallel Computing - MPI/OpenMP

  6. SPMD Parallel Computing • Simple Code Example - almost correct Integer Rank Call MPI_INIT(return_code) Dimension Psi(0:100) Call MPI_Rank(Rank,return_code) Write(6,*) Rank Do i=0,Rank Psi(i) = i Enddo Write(6,*) (Psi(i),i=0,Rank) Call MPI_finish(return_code) End Parallel Computing - MPI/OpenMP

  7. SPMD Parallel Computing • Simple Code Example - almost correct • Assuming four parallel processes • The Output looks like this • 0 • 0.0 • 2 • 0.0,1.0,2.0 • 3 • 0.0,1.0,2.0,3.0 • 1 • 0.0,1.0 MPI has no standard for the sequence of appearance in output streams Parallel Computing - MPI/OpenMP

  8. SPMD Parallel Computing We’ll get back to MPI coding after we figure out how we are going to do the domain decomposition. The Omega Domain Ώ Ώ0 Ώ1 Ώ2 Parallel Computing - MPI/OpenMP

  9. Discussion Time Parallel Computing - MPI/OpenMP

  10. Domain Decomposition • Subdivision of problem domain into parallel regions • Example using 2 dimensional data arrays • Linear One Dimension versus • Grid of Two Dimensions Parallel Computing - MPI/OpenMP

  11. Single Processor Memory Arrays, Nx by Ny Dimension Array (Nx,Ny) Parallel Computing - MPI/OpenMP

  12. Multiple Processor Memory Arrays, Nx/2 by Ny/2 4 Processors Two way decomposition Parallel Computing - MPI/OpenMP

  13. Multiple Processor Memory Arrays, Nx by Ny/3 3 Processors One way decomposition Parallel Computing - MPI/OpenMP

  14. Multiple Processor Memory Arrays, Nx/3 by Ny 3 Processors One way decomposition – the other way Parallel Computing - MPI/OpenMP

  15. So which one is better? Or does it make a difference?? One way decomposition – one way or the other? Parallel Computing - MPI/OpenMP

  16. Dimension Array (Nx,Ny) becomes Dimension Array (Nx/3,Ny) or Dimension Array (Nx,Ny/3) The Nx/3 in Fortran has shorter do loop lengths in the fastest moving index. Which could limit performance. Further, the sharing of data via message passing will have non-unit stride data access patterns. Parallel Computing - MPI/OpenMP

  17. So the design issue becomes one of choice for the programming language Decide which language you need to use and then Create the decomposition plan Parallel Computing - MPI/OpenMP

  18. Realize of course that a one-dimension decomposition has on Np_one processors And a two dimensional decomposition could have Np_two x Np_two. So in the design of your parallel code you would have to be aware of your resources. Further, few if any programs scale well and being realistic about the number of Processors to be used in important in deciding how hard you want to work at the parallelization effort. Parallel Computing - MPI/OpenMP

  19. Discussion Time Parallel Computing - MPI/OpenMP

  20. Processor Interconnections • Communication hardware connects the processors • These wires carry data and address information • The best interconnection is the most expensive -- all machines have a direct connection to all other machines • Because of cost we have to compromise Parallel Computing - MPI/OpenMP

  21. Processor Interconnections • The slowest is also the cheapest • We could just let each machine connect to some other machine in a daisy chain fashion. • Messages would bump along until they reach their destination. • What other schemes are possible? Parallel Computing - MPI/OpenMP

  22. Processor Interconnections • The Linear Daisy Chain • The Binary tree • The Fat Tree • The FLAT network • The Hypercube • The Torus • The Ring • The Cross Bar • And many, many others Parallel Computing - MPI/OpenMP

  23. The Linear Daisy Chain Processor 0 Processor 1 Processor 2 Parallel Computing - MPI/OpenMP

  24. The Cross Bar Processor 0 Processor 1 Processor 2 Processor 0 Processor 1 Processor 2 O(1) but Switch is O(n^2) The fastest and most expensive Parallel Computing - MPI/OpenMP

  25. Lets Look at the Binary TreeO(Log N) Parallel Computing - MPI/OpenMP

  26. Lets Look at the Fat TreeO(Log N)+ Parallel Computing - MPI/OpenMP

  27. Lets Look at the Hypercube Order 1 Order 2 Order 3 Duplicate and connect the edges together Parallel Computing - MPI/OpenMP

  28. Lets Look at the Binary Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Little more expensive • Root is a bottleneck Parallel Computing - MPI/OpenMP

  29. Lets Look at the Fat Tree • Every node can reach every other node • Has Log N connections, 32 nodes have 5 levels • Some neighbors are far apart • Even more expensive • Root bottleneck is better managed • Each level has multiple connections Parallel Computing - MPI/OpenMP

  30. Fortran MPI Commands INCLUDE 'mpif.h' MPI_INIT(ierr) MPI_COMM_SIZE(MPI_COMM_WORLD,p,ierr) MPI_COMM_RANK(MPI_COMM_WORLD, my_rank,ierr) Parallel Computing - MPI/OpenMP

  31. MPI Commands MPI_SCATTER(A,chunkA,MPI_REAL,A_local,chunkA, MPI_REAL,0,MPI_COMM_WORLD, ierr) MPI_SCATTER(b,chunkb,MPI_REAL,b_local,chunkb, MPI_REAL,0,MPI_COMM_WORLD, ierr) MPI_ALLGATHER(x_local,chunkb,MPI_REAL,x_new, chunkb,MPI_REAL, MPI_COMM_WORLD,ierr) Parallel Computing - MPI/OpenMP

  32. Message Passing • Broadcast L I K+1 J K Broadcast from the Ith processor to all other processors Parallel Computing - MPI/OpenMP

  33. Message Passing • Broadcast - Scatter L I K+1 Chunk Is of Uniform length J K Scatter from the Ith processor to all other processors Parallel Computing - MPI/OpenMP

  34. Message Passing • Gather L I K+1 J K Gather from all other processors to the Ith processor Parallel Computing - MPI/OpenMP

  35. Message Passing • Broadcast - Gather L I K+1 Chunk Is of Uniform length J K Gather from the Ith processor to all other processors Parallel Computing - MPI/OpenMP

  36. Message Passing • Exchange – Based on User Topology Ring Or Linear L I K+1 J K Based on connection topology processors exchange information Parallel Computing - MPI/OpenMP

  37. Parallel Computing - MPI/OpenMP

  38. Just what is a message? Message Content To: You@Address From: Me@Address Parallel Computing - MPI/OpenMP

  39. Just what is a message? Message Content To: You@Address:Attn Payroll From: Me@Address:Attn Payroll Parallel Computing - MPI/OpenMP

  40. Message Structure • To: Address(Rank) • Content(starting array/vector/word address and length) • Tag • Data Type • Error Flag • Communicator We know who we are so From: Address(Rank) is implicit! Parallel Computing - MPI/OpenMP

  41. Messaging • For every SEND we must have a RECEIVE! • The transmission is one-sided the receiver agrees to allow the sender to put the data into a memory location in the receiver process. Parallel Computing - MPI/OpenMP

  42. Message Passing The interconnection topology is called a communicator – Predefined at startup However the user can define his own topology – and should as needed A problem dependent communicator – actually more than one can be defined as needed Parallel Computing - MPI/OpenMP

  43. Program Structure Processor Rank 0 Processor Rank 1 Processor Rank 2 Input Loops Output Input Loops Output Input Loops Output Sync-Barriers Parallel Computing - MPI/OpenMP

  44. MPI Send – Receive Send Processor K Count Receive Processor L Length ≥ Count Each cell holds one MPI_Data_Type MPI_Data_Type Must be the same! MPI_Data_Type Parallel Computing - MPI/OpenMP

  45. MPI Data_Types Type Number of bytes Float 4 Double 8 Integer 4? Boolean 4 Character 1? A bit of care is need between Fortran and C data types Parallel Computing - MPI/OpenMP

  46. #define MPI_BYTE           ... #define MPI_PACKED         ... #define MPI_CHAR           ...            #define MPI_SHORT          ...             #define MPI_INT            ...             #define MPI_LONG           ...             #define MPI_FLOAT          ...             #define MPI_DOUBLE         ...             #define MPI_LONG_DOUBLE   ... #define MPI_UNSIGNED_CHAR ... Parallel Computing - MPI/OpenMP

  47. Parallel Computing - MPI/OpenMP

  48. MPI Data_TYPE Issues • Just what is a data type? • How many bits? • Big Endian versus Little Endian? • What ever is used must be consistent! • Could type conversions be automatic or transparent?? Parallel Computing - MPI/OpenMP

  49. The Grid for 2 by 2 Parallel Computing - MPI/OpenMP

  50. 2D 1 Nx/2-1 Nx/2 Nx 1 Ny/2-1 Ny/2 Ny P2 P0 u(i,j) = u(i-1,j)+u(i,j) P3 P1 u(i-1,j) u(i,j) How do the processors see the variables they don’t have? Parallel Computing - MPI/OpenMP

More Related