1 / 54

Chapter 3

Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.

mahon
Download Presentation

Chapter 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Parallel Algorithm Design

  2. Outline • Task/channel model • Algorithm design methodology • Case studies

  3. Task/Channel Model • Parallel computation = set of tasks • Task • Program • Local memory • Collection of I/O ports • Tasks interact by sending messages through channels

  4. Task Channel Task/Channel Model

  5. Foster’s Design Methodology • Partitioning • Dividing the Problem into Tasks • Communication • Determine what needs to be communicated between the Tasks over Channels • Agglomeration • Group or Consolidate Tasks to improve efficiency or simplify the programming solution • Mapping • Assign tasks to the Computer Processors

  6. Foster’s Methodology

  7. Step 1: PartitioningDivide Computation & Data into Pieces • Domain Decomposition – Data Centric Approach • Divide up most frequently used data • Associate the computations with the divided data • Functional Decomposition – Computation Centric Approach • Divide up the computation • Associate the data with the divided computations • Primitive Tasks: Resulting Pieces from either Decomposition • The goal is to have as many of these as possible

  8. Example Domain Decompositions

  9. Example Functional Decomposition

  10. Partitioning Checklist • Lots of Tasks • e.g, at least 10x more primitive tasks than processors in target computer • Minimize redundant computations and data • Load Balancing • Primitive tasks roughly the same size • Scalable • Number of tasks an increasing function of problem size

  11. Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks • Local Communication • When Tasks need data from a small number of other Tasks • Channel from Producing Task to Consuming Task Created • Global Communication • When Task need data from many or all other Tasks • Channels for this type of communication are not created during this step

  12. Communication Checklist • Balanced • Communication operations balanced among tasks • Small degree • Each task communicates with only small group of neighbors • Concurrency • Tasks can perform communications concurrently • Task can perform computations concurrently

  13. Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming • Increase Locality • remove communication by agglomerating Tasks that Communicate with one another • Combine groups of sending & receiving task • Send fewer, larger messages rather than more short messages which incur more message latency. • Maintain Scalability of the Parallel Design • Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible • Reduce Software Engineering costs • Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm

  14. Agglomeration Can Improve Performance • Eliminate communication between primitive tasks agglomerated into consolidated task • Combine groups of sending and receiving tasks

  15. Agglomeration Checklist • Locality of parallel algorithm has increased • Tradeoff between agglomeration and code modifications costs is reasonable • Agglomerated tasks have similar computational and communications costs • Number of tasks increases with problem size • Number of tasks suitable for likely target systems

  16. Step 4: MappingAssigning Tasks to Processors • Maximize Processor Utilization • Ensure computation is evenly balanced across all processors • Minimize Interprocess Communication

  17. Optimal Mapping • Finding optimal mapping is NP-hard • Must rely on heuristics

  18. Decision Tree for Parallel Algorithm Design

  19. Mapping Goals • Mapping based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck • If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1

  20. Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input

  21. Boundary Value Problem Ice water Rod Insulation

  22. Rod Cools as Time Progresses

  23. Finite Difference Approximation

  24. Partitioning • One data item per grid point • Associate one primitive task with each grid point • Two-dimensional domain decomposition

  25. Communication • Identify communication pattern between primitive tasks: • Each interior primitive task has three incoming and three outgoing channels

  26. Agglomeration Agglomeration and Mapping

  27. Execution Time Estimate •  – time to update element • n – number of elements • m – number of iterations • Sequential execution time: mn • p – number of processors •  – message latency • Parallel execution time m(n/p+2)

  28. Finding the Maximum Error 6.25%

  29. Reduction • Given associative operator  • a0  a1  a2  …  an-1 • Examples • Add • Multiply • And, Or • Maximum, Minimum

  30. Parallel Reduction Evolution

  31. Parallel Reduction Evolution

  32. Parallel Reduction Evolution

  33. Subgraph of hypercube Binomial Trees

  34. Finding Global Sum 4 2 0 7 -3 5 -6 -3 8 1 2 3 -4 4 6 -1

  35. Finding Global Sum 1 7 -6 4 4 5 8 2

  36. Finding Global Sum 8 -2 9 10

  37. Finding Global Sum 17 8

  38. Binomial Tree Finding Global Sum 25

  39. Agglomeration

  40. sum sum sum sum Agglomeration

  41. The n-body Problem

  42. The n-body Problem

  43. Partitioning • Domain partitioning • Assume one task per particle • Task has particle’s position, velocity vector • Iteration • Get positions of all other particles • Compute new position, velocity

  44. Gather

  45. All-gather

  46. Complete Graph for All-gather

  47. Hypercube for All-gather

  48. Complete graph Hypercube Communication Time

  49. Adding Data Input

  50. Scatter

More Related