360 likes | 694 Views
Parallel Algorithm Design. Parallel Algorithm Design. Look at Ian Foster’s Methodology (PCAM) Partition Decompose the problem Identify the concurrent tasks Often the most difficult step Communication Often dictated by partition Agglomeration Often not much you can do here Mapping
E N D
Parallel Algorithm Design • Look at Ian Foster’s Methodology (PCAM) • Partition • Decompose the problem • Identify the concurrent tasks • Often the most difficult step • Communication • Often dictated by partition • Agglomeration • Often not much you can do here • Mapping • Difficult problem • Load Balancing We will focus on Partitioning and Mapping
Preliminaries • A given problem may be partitioned in many different ways. • Tasks may be the same, different, or even indeterminate sizes • Coarse grain – large tasks • Fine grain – very small tasks • Often, partitionings are illustrated in the form of a “task dependency graph” • Directed graph • Nodes are tasks • Edges denote that the result of one task is needed for the computation of the result in another task
Task Dependency Graph • Can be a graph or adjacency matrix
Preliminaries • Degree of Concurrency • The number of tasks that can be executed in parallel • Maximum Degree of Concurrency • the maximum number of such tasks at any point during execution • Since the number of tasks that can be executed in parallel may change over program execution,. • Average degree of concurrency • the average number of tasks that can be processed in parallel over the execution of the program. • The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa. • When viewed strictly from a “number of tasks” perspective. • However, the number of concurrent operations may not follow this relationship
Preliminaries • A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. • The longest path determines the shortest time in which the program can be executed in parallel. • The length of the longest path in a task dependency graph is called the critical path length. What are the critical path lengths? If each task takes 10 time units, what is the shortest parallel execution time?
Limits on Parallel Performance • It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. • There is an inherent bound on how fine the granularity of a computation can be. • For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. • Concurrent tasks may also have to exchange data with other tasks. This results in communication overhead. • The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds.
Partitioning Techniques • There is no single recipe that works for all problems. • We can benefit from some commonly used techniques: • Recursive Decomposition • Data Decomposition • Exploratory Decomposition • Speculative Decomposition
Recursive Decomposition • Generally suited to problems that are solved using a divide and conquer strategy. • Decompose based on sub-problems • Often results in natural concurrency as sub-problems can be solved in parallel. • Need to think recursively • parallel not sequential
Recursive Decomposition: Quicksort • Once the list has been partitioned around the pivot, each sublist can be processed concurrently. • Once each sublist has been partitioned around the pivot,each sub-sublist can be processed concurrently. • Once each sub-sublist …
Recursive Decomposition:Finding the Min/Max/Sum • Any associative and commutative operation. 1. procedureSERIAL_MIN (A, n) 2. begin 3. min = A[0]; 4. fori:= 1 ton − 1 do 5. if(A[i] < min) min := A[i]; 6. endfor; 7. returnmin; 8. endSERIAL_MIN
Recursive Decomposition:Finding the Min/Max/Sum • Rewrite using recursion and max partitioning • Don’t make a serial recursive routine 1. procedure RECURSIVE_MIN (A, n) 2. begin3. if ( n =1 ) then4. min := A [0] ; 5. else6. lmin := RECURSIVE_MIN ( A, n/2 ); 7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2); 8. if (lmin < rmin) then9. min := lmin; 10. else11. min := rmin; 12. endelse; 13. endelse; 14. returnmin; 15. end RECURSIVE_MIN Note: Divide the work in half each time.
Recursive Decomposition:Finding the Min/Max/Sum • Example: Find min of {4,9,1,7,8,11,2,12} Step 12 12 12 11 11 11 1 4 9 1 7 8 2 9 9 7 2 1 2 1 7 2 2 1 1 2 3
Recursive Decomposition:Finding the Min/Max/Sum • Strive to divide in half • Often, can be mapped to a hypercube for a very efficient algorithm • Make sure that the overhead of dividing the computation is worth it. • How much does it cost to communicate necessary dependencies?
Data Decomposition • Most common approach • Identify the data and partition across tasks • Can partition in various ways • critically impacts performance • Three approaches • Output Data Decomposition • Input Data Decomposition • Domain Decomposition
Output Data Decomposition • Often, each element of the output can be computed independently of the others • A function of the input • All may be able to share the input or have a copy of their own • Often decomposes the problem naturally. • Embarrassingly Parallel • Output data decomposition with no need for communication • Mandelbrot, Simple Ray Tracing, etc.
Output Data Decomposition • Matrix Multiplication: A * B = C • Can partition output matrix C
Output Data Decomposition • Count the instances of given itemsets
Input Data Decomposition • Applicable if the output can be naturally computed as a function of the input. • In many cases, this is the only natural decomposition because the output is not clearly known a-priori • finding minimum in list, sorting, etc. • Associate a task with each input data partition. • Tasks communicate where necessary input is “owned” by another task.
Input Data Decomposition • Count the instances of given itemsets • Each task generates partial counts for all itemsets which must be aggregated.
Input & Output Data Decomposition • Often, partitioning either input data or output data forces a partition of the other. • Can also consider partitioning both
Domain Decomposition • Often can be viewed as input data decomposition • May not be input data • Just domain of calculation • Split up the domain among tasks • Each task is responsible for computing the answer for its partition of the domain • Tasks may end up needing to communicate boundary values to perform necessary calculations
1 4 ò + 2 1 x 0 Domain Decomposition • Evaluate the definite integral Each task evaluates the integral in their partition of the domain Once all have finished, sum each tasks answer for the total. 0 0.25 0.5 0.75 1
Domain Decomposition • Often a natural approach for grid/matrix problems There are algorithms for more complex domain decomposition problems We will consider these algorithms later.
Exploratory Decomposition • In many cases, the decomposition of a problem goes hand-in-hand with its execution. • Typically, these problems involve the exploration of a state space. • Discrete optimization • Theorem proving • Game playing
Exploratory Decomposition • 15 puzzle – put the numbers in order • only move one piece at a time to a blank spot
Exploratory Decomposition • Generate successor states and assign to independent tasks.
Exploratory Decomposition • Exploratory decomposition techniques may change the amount of work done by the parallel implementation. • Change can result in super- or sub-linear speedups
Speculative Decomposition • Sometimes, dependencies are not known a-priori • Two approaches • conservative – identify independent tasks only when they are guaranteed to not have dependencies • May yield little concurrency • optimistic – schedule tasks even when they may be erroneous • May require a roll-back mechanism in the case of an error. • The speedup due to speculative decomposition can add up if there are multiple speculative stages • Examples • Concurrently evaluating all branches of a C switch stmt • Discrete event simulation
Speculative DecompositionDiscrete Event Simulation • The central data structure is a time-ordered event list. • Events are extracted precisely in time order, processed, and if required, resulting events are inserted back into the event list. • Consider your day today as a discrete event system - • you get up, get ready, drive to work, work, eat lunch, work some more, drive back, eat dinner, and sleep. • Each of these events may be processed independently, • however, in driving to work, you might meet with an unfortunate accident and not get to work at all. • Therefore, an optimistic scheduling of other events will have to be rolled back.
Speculative DecompositionDiscrete Event Simulation • Simulate a network of nodes • various inputs, node delay parameters, queue sizes, service rates, etc.
Hybrid Decomposition • Often, a mix of decomposition techniques is necessary • In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive decompositions is more desirable. • In discrete event simulation, there might be concurrency in task processing. A mix of speculative decomposition and data decomposition may work well. • Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive decomposition works well.