1 / 51

Parallel Algorithms

Parallel Algorithms. Computation Models. Goal of computation model is to provide a realistic representation of the costs of programming. Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “ good ” (i.e. performance-efficient).

liam
Download Presentation

Parallel Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Algorithms

  2. Computation Models • Goal of computation model is to provide a realistic representation of the costs of programming. • Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)

  3. Goal for Modeling • We want to develop computational models which accurately represent the cost and performance of programs • If model is poor, optimum in model may not coincide with optimum observed in practice Model Real World x optimum A optimum B Y

  4. Models of Computation What’s a model good for?? • Provides a way to think about computers. Influences design of: • Architectures • Languages • Algorithms • Provides a way of estimating how well a program will perform. Cost in model should be roughly same as cost of executing program

  5. The Random Access Machine Model RAM model of serial computers: • Memory is a sequence of words, each capable of containing an integer. • Each memory access takes one unit of time • Basic operations (add, multiply, compare) take one unit time. • Instructions are not modifiable • Read-only input tape, write-only output tape

  6. Has RAM influenced our thinking? Language design: No way to designate registers, cache, DRAM. Most convenient disk access is as streams. How do you express atomic read/modify/write? Machine & system design: It’s not very easy to modify code. Systems pretend instructions are executed in-order. Performance Analysis: Primary measures are operations/sec (MFlop/sec, MHz, ...) What’s the difference between Quicksort and Heapsort??

  7. What about parallel computers • RAM model is generally considered a very successful “bridging model” between programmer and hardware. • “Since RAM is so successful, let’s generalize it for parallel computers ...”

  8. PRAM [Parallel Random Access Machine] PRAM composed of: • P processors, each with its own unmodifiable program. • A single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. • a read-only input tape. • a write-only output tape. PRAM model is a synchronous, MIMD, shared address space parallel computer. (Introduced by Fortune and Wyllie, 1978)

  9. PRAM model of computation Shared memory • p processors, each with local memory • Synchronous operation • Shared memory reads and writes • Each processor has unique id in range 1-p

  10. Characteristics • At each unit of time, a processor is either active or idle (depending on id) • All processors execute same program • At each time step, all processors execute same instruction on different data (“data-parallel”) • Focuses on concurrency only

  11. Variants of PRAM model

  12. More PRAM taxonomy • Different protocols can be used for reading and writing shared memory. • EREW - exclusive read, exclusive write A program isn’t allowed to have two processors access the same memory location at the same time. • CREW - concurrent read, exclusive write • CRCW - concurrent read, concurrent write Needs protocol for arbitrating write conflicts • CROW – concurrent read, owner write Each memory location has an official “owner” • PRAM can emulate a message-passing machine by partitioning memory into private memories.

  13. Sub-variants of CRCW • Common CRCW • CW iff all processors writing same value • Arbitrary CRCW • Arbitrary value of write set stored • Priority CRCW • Value of min-index processor stored • Combining CRCW

  14. Why study PRAM algorithms? • Well-developed body of literature on design and analysis of such algorithms • Baseline model of concurrency • Explicit model • Specify operations at each step • Scheduling of operations on processors • Robust design paradigm

  15. Work-Time paradigm • Higher-level abstraction for PRAM algorithms • WT algorithm = (finite) sequence of time steps with arbitrary number of operations at each step • Two complexity measures • Step complexity T(n) • Work complexity W(n) WT algorithm work-efficient ifW(n) = Q(TS(n)) optimal sequential Algorithm

  16. Designing PRAM algorithms • Balanced trees • Pointer jumping • Euler tours • Divide and conquer • Symmetry breaking • . . .

  17. Balanced trees • Key idea: Build balanced binary tree on input data, sweep tree up and down • “Tree” not a data structure, often a control structure (e.g., recursion)

  18. Alg : Sum • Given: Sequence a of n = 2k elements • Given: Binary associative operator + • Compute: S = a1 + ... + an

  19. WT description of sum integer B[1..n] forall i in 1 : n do B[i] := ai enddo for h = 1 to k do forall i in 1 : n/2hdo B[i] := B[2i-1] + B[2i] enddo enddo S := B[1]

  20. Points to note about WT pgm • Global program: no references to processor id • Contains both serial and concurrent operations • Semantics of forall • Order of additions different from sequential order: associativity critical

  21. Analysis of scan operation • Algorithm is correct • Q(lg n) steps, Q(n) work • EREW model • Two variants • Inclusive: as discussed • Exclusive: s1 = I, sk = x1 + ... + xk-1 • If n not power of 2, pad to next power

  22. Complexity measures of Sum • Recall definitions of step complexity T(n) and work complexity W(n) • Concurrent execution reduces number of steps

  23. How to do prefix sum ? • Input: Sequence x of n = 2kelements, binary associative operator + • Output: Sequence s of n = 2kelements, with sk = x1 + ... + xk • Example: x = [1, 4, 3, 5, 6, 7, 0, 1] s = [1, 5, 8, 13, 19, 26, 26, 27]

  24. List Ranking • List ranking problem • Given a singly linked list L with n objects, for each node, compute the distance to the end of the list • If d denotes the distance • node.d = 0 if node.next = nil • node.next.d + 1 otherwise • Serial algorithm: O(n) • Parallel algorithm • Assign one processor for each node • Assume there are as many processors as list objects • For each node i, perform • i.d = i.d + i.next.d • i.next = i.next.next // pointer jumping {

  25. List Ranking - Pointer Jumping • List_ranking(L) • for each node i, in parallel do • if i.next = nil then i.d = 0 • else i.d = 1 • while exists a node i, such that i.next != nil do • for each node i, in parallel do • if i.next != nil then • i.d = i.d + i.next.d // i updates i itself • i.next = i.next.next • Analysis • After a pointer jumping, a list is transformed into two (interleaved) lists • After that, four (interleaved) lists • Each pointer jumping doubles the number of lists and halves their length • After log n, all lists contain only one node • Total time: O(log n)

  26. List Ranking - Example

  27. List Ranking - Discussion • Synchronization is important • In step 8 (i.next = i.next.next), all processors must read right hand side before any processor write left hand side • The list ranking algorithm is EREW • If we assume in step 7 (i.d = i.d + i.next.d) all processors read i.d and then read i.next.d • If j.next = i, i and j do not read i.d concurrently • Work performance • performs O(n log n) work since n processors in O(log n) time • Work efficient • A PRAM algorithm is work efficient w.r.t another algorithm if two algorithms are within a constant factor • Is the link ranking algorithm work-efficient w.r.t the serial algorithm? • No, because O(n log n) versus O(n) • Speedup • S = n / log n

  28. Parallel Prefix on a List • Prefix computation • Input <x1, x2, .., xn>, a binary, associative operator  • Output <y1, y2, .., yn> • Prefix computation: yk = x1 x2 ..  xk • Example • if xk = 1 for k=1..n and  = + • Then yk = k, for k = 1..n • Serial algorithm: O(n) • Notation • [i, j] = xi xi+1 ..  xj • [k, k] = xk • [i, k]  [k+1, j] = [i, j] • Idea: perform prefix computation on a linked list so that • each node k contains [k, k] = xk initially • finally each node k contains [1, k] = yk

  29. Parallel Prefix on a List (2) • List_prefix(L, X) // L: list, X: <x1, x2, .., xn> • for each node i, in parallel • i.y = xi • While exists a node i such that i.next != nil do • for each node i, in parallel do • if i.next != nil then • i.next.y = i.y  i.next.y // i updates its successor • i.next = i.next.next • Analysis • Initially k-th node has [k,k] as y-value, points to (k+1)-th node • At the first iteration, • k-th node fetches [k+1,k+1] from its successor and • perform [k,k]  [k+1,k+1] resulting in [k,k+1] and • update its successor • At the second iteration • k-th node fetches [k+1,k+2] from its successor and • perform [k-1,k]  [k+1,k+2] resulting in [k-1,k+2] and • update its successor

  30. Parallel Prefix on a List (3)

  31. Parallel Prefix on a List (4) • Running time: O(log n) • After log n, all lists contain only one node • Work performed: O(n log n) • Speedup • S = n / log n

  32. Pointer jumping • Fast parallel processing of linked data structures (lists, trees) • Convention: Draw trees with edges directed from children to parents • Example: Finding the roots of forest represented as parent array P • P[i] = j if and only if (i, j) is a forest edge • P[i] = i if and only if i is a root

  33. Algorithm (Roots of forest) forall i in 1:n do S[i] := P[i] while S[i] != S[S[i]] do S[i] := S[S[i]] endwhile enddo

  34. Initial state of forest

  35. After one iteration

  36. After another iteration

  37. Concurrent Read – Finding Roots

  38. Analysis of pointer jumping • Termination detection? • At each step, tree distance between i and S[i] doubles unless S[i] is a root • CREW model • Correctness by induction on h • O(lg h) steps, O(n lg h) work • TS(n) = O(n) • Not work-efficient unless h constant

  39. Concurrent Read – Finding Roots • This is a CREW algorithm • Suppose Exclusive-Read is used, what will be the running time? • Initially only one node i has root information • First iteration: Another node reads from the node i • Totally two nodes are filled up • Second iteration: Another two nodes can reads from the two nodes • Totally four nodes are filled up • k-th iteration: 2k-1 nodes are filled up • If there are n nodes, k=log n • So Find_root with Exclusive-Read takes O(log n). • O(log log n) vs. O(log n)

  40. Euler tours • Technique for fast optimal processing of tree data • Euler circuit of directed graph: directed cycle that traverses each edge exactly once • Represent (rooted) tree by Euler circuit of its directed version

  41. Trees = balanced parentheses ((()( ) )( ) (( ) ( ) ( ) )) Key property: The parenthesis subsequence corresponding to a subtree is balanced.

  42. Computing the Depth • Problem definition • Given a binary tree with n nodes, compute the depth of each node • Serial algorithm takes O(n) time • A simple parallel algorithm • Starting from root, compute the depths level by level • Still O(n) because the height of the tree could be as high as n • Euler tour algorithm • Uses parallel prefix computation

  43. Computing the Depth (2) • Euler tour: A cycle that traverses each edge exactly once in a graph • It is a directed version of a tree • Regard an undirected edge into two directed edges • Any directed version of a tree has an Euler tour by traversing the tree • in a DFS way forming a linked list. • Employ 3*n processors • Each node i has fields i.parent, i.left, i.right • Each node i has three processors, i.A, i.B, and i.C. • Three processors in each node of the tree are linked as follows • i.A = i.left.A if i.left != nil • i.B if i.left = nil • i.B = i.right.A if i.right != nil • i.C if i.right = nil • i.C = i.parent.B if i is the left child • i.parent.C if i is the right child • nil if i.parent = nil { { {

  44. Computing the Depth (3) • Algorithm • Construct the Euler tour for the tree – O(1) time • Assign 1 to all A processors, 0 to B processors, -1 to C processors • Perform a parallel prefix computation • The depth of each node resides in its C processor • O(log n) • Actually log 3n • EREW because no concurrent read or write • Speedup • S = n/log n

  45. Computing the Depth (4)

  46. M B P P P P P P P P Broadcasting on a PRAM • “Broadcast” can be done on CREW PRAM in O(1) steps: • Broadcaster sends value to shared memory • Processors read from shared memory • Requires lg(P) steps on EREW PRAM.

  47. Concurrent Write – Finding Max • Finding max problem • Given an array of n elements, find the maximum(s) • sequential algorithm is O(n) • Data structure for parallel algorithm • Array A[1..n] • Array m[1..n]. m[i] is true if A[i] is the maximum • Use n2 processors • Fast_max(A, n) • for i = 1 to n do, in parallel • m[i] = true // A[i] is potentially maximum • for i = 1 to n, j = 1 to n do, in parallel • if A[i] < A[j] then • m[i] = false • for i = 1 to n do, in parallel • if m[i] = true then max = A[i] • return max • Time complexity: O(1)

  48. Concurrent Write – Finding Max • Concurrent-write • In step 4 and 5, processors with A[i] < A[j] write the same value ‘false’ into the same location m[i] • This actually implements m[i] = (A[i]  A[1])  …  (A[i]  A[n]) • Is this work efficient? • No, n2 processors in O(1) • O(n2) work vs. sequential algorithm is O(n) • What is the time complexity for the Exclusive-write? • Initially elements “think” that they might be the maximum • First iteration: For n/2 pairs, compare. • n/2 elements might be the maximum. • Second iteration: n/4 elements might be the maximum. • log n th iteration: one element is the maximum. • So Fast_max with Exclusive-write takes O(log n). • O(1) (CRCW) vs. O(log n) (EREW)

  49. Simulating CRCW with EREW • CRCW algorithms are faster than EREW algorithms • How much fast? • Theorem • A p-processor CRCW algorithm can be no more than O(log p) times faster than the best p-processor EREW algorithm • Proof by simulating CRCW steps with EREW steps • Assumption: A parallel sorting takes O(log n) time with n processors • When CRCW processor pi write a datum xi into a location li, EREW pi writes the pair (li, xi) into a separate location A[i] • Note EREW write is exclusive, while CRCW may be concurrent • Sort A by li • O(log p) time by assumption • Compare adjacent elements in A • For each group of the same elements, only one processor, say first, write xi into the global memory li. • Note this is also exclusive. • Total time complexity: O(log p)

  50. Simulating CRCW with EREW (2)

More Related