540 likes | 667 Views
Chapter 3. Parallel Algorithm Design. Outline. Task/channel model Algorithm design methodology Case studies. Task/Channel Model. Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels. Task. Channel.
E N D
Chapter 3 Parallel Algorithm Design
Outline • Task/channel model • Algorithm design methodology • Case studies
Task/Channel Model • Parallel computation = set of tasks • Task • Program • Local memory • Collection of I/O ports • Tasks interact by sending messages through channels
Task Channel Task/Channel Model
Foster’s Design Methodology • Partitioning • Dividing the Problem into Tasks • Communication • Determine what needs to be communicated between the Tasks over Channels • Agglomeration • Group or Consolidate Tasks to improve efficiency or simplify the programming solution • Mapping • Assign tasks to the Computer Processors
Step 1: PartitioningDivide Computation & Data into Pieces • Domain Decomposition – Data Centric Approach • Divide up most frequently used data • Associate the computations with the divided data • Functional Decomposition – Computation Centric Approach • Divide up the computation • Associate the data with the divided computations • Primitive Tasks: Resulting Pieces from either Decomposition • The goal is to have as many of these as possible
Partitioning Checklist • Lots of Tasks • e.g, at least 10x more primitive tasks than processors in target computer • Minimize redundant computations and data • Load Balancing • Primitive tasks roughly the same size • Scalable • Number of tasks an increasing function of problem size
Step 2: CommunicationDetermine Communication Patterns between Primitive Tasks • Local Communication • When Tasks need data from a small number of other Tasks • Channel from Producing Task to Consuming Task Created • Global Communication • When Task need data from many or all other Tasks • Channels for this type of communication are not created during this step
Communication Checklist • Balanced • Communication operations balanced among tasks • Small degree • Each task communicates with only small group of neighbors • Concurrency • Tasks can perform communications concurrently • Task can perform computations concurrently
Step 3: AgglomerationGroup Tasks to Improve Efficiency or Simplify Programming • Increase Locality • remove communication by agglomerating Tasks that Communicate with one another • Combine groups of sending & receiving task • Send fewer, larger messages rather than more short messages which incur more message latency. • Maintain Scalability of the Parallel Design • Be careful not to agglomerate Tasks so much that moving to a machine with more processors will not be possible • Reduce Software Engineering costs • Leveraging existing sequential code can reduce the expense of engineering a parallel algorithm
Agglomeration Can Improve Performance • Eliminate communication between primitive tasks agglomerated into consolidated task • Combine groups of sending and receiving tasks
Agglomeration Checklist • Locality of parallel algorithm has increased • Tradeoff between agglomeration and code modifications costs is reasonable • Agglomerated tasks have similar computational and communications costs • Number of tasks increases with problem size • Number of tasks suitable for likely target systems
Step 4: MappingAssigning Tasks to Processors • Maximize Processor Utilization • Ensure computation is evenly balanced across all processors • Minimize Interprocess Communication
Optimal Mapping • Finding optimal mapping is NP-hard • Must rely on heuristics
Mapping Goals • Mapping based on one task per processor and multiple tasks per processor have been considered • Both static and dynamic allocation of tasks to processors have been evaluated • If a dynamic allocation of tasks to processors is chosen, the Task allocator is not a bottleneck • If Static allocation of tasks to processors is chosen, the ratio of tasks to processors is at least 10 to 1
Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input
Boundary Value Problem Ice water Rod Insulation
Partitioning • One data item per grid point • Associate one primitive task with each grid point • Two-dimensional domain decomposition
Communication • Identify communication pattern between primitive tasks: • Each interior primitive task has three incoming and three outgoing channels
Agglomeration Agglomeration and Mapping
Execution Time Estimate • – time to update element • n – number of elements • m – number of iterations • Sequential execution time: mn • p – number of processors • – message latency • Parallel execution time m(n/p+2)
Reduction • Given associative operator • a0 a1 a2 … an-1 • Examples • Add • Multiply • And, Or • Maximum, Minimum
Subgraph of hypercube Binomial Trees
Finding Global Sum 4 2 0 7 -3 5 -6 -3 8 1 2 3 -4 4 6 -1
Finding Global Sum 1 7 -6 4 4 5 8 2
Finding Global Sum 8 -2 9 10
Finding Global Sum 17 8
Binomial Tree Finding Global Sum 25
sum sum sum sum Agglomeration
Partitioning • Domain partitioning • Assume one task per particle • Task has particle’s position, velocity vector • Iteration • Get positions of all other particles • Compute new position, velocity
Complete graph Hypercube Communication Time