1 / 70

Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms: PRAM + CGM. Outline. Parallel Performance Parallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM). Parallel Analysis of Algorithms. Question?.

rivka
Download Presentation

Parallel Analysis of Algorithms: PRAM + CGM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Analysis of Algorithms: PRAM + CGM

  2. Outline • Parallel Performance • Parallel Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

  3. Parallel Analysis of Algorithms Question? • Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. • Do you believe him? Parallel Analysis of Algorithms

  4. Parallel Analysis of Algorithms Performance of a Parallel Algorithm • n : problem size (e.g.: sort n numbers) • p : number of processors • T(p): parallel time • Ts : sequential time (optimal sequ. alg.) • s(p) = Ts / T(p) : speedup (1sp) s s(p)=p super-linear linear sub-linear p Parallel Analysis of Algorithms

  5. Parallel Analysis of Algorithms Speedup • linear speedup s(p) = p optimal • super linear speedup s(p) > p : impossible Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction. Parallel Analysis of Algorithms

  6. Parallel Analysis of Algorithms Amdahl’s Law • Let f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p]. Proof: Ts = sequ. time. The T(p)  f Ts + (1-f)Ts / p. Hence s  Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p]. Parallel Analysis of Algorithms

  7. Amdahl’s Law t s ft (1 - f ) t s s Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 - f ) t / p s t Parallel Analysis of Algorithms p

  8. Parallel Analysis of Algorithms Amdahl’s Law P=1 time P=5 P=10 P=1000 Parallel Analysis of Algorithms

  9. Parallel Analysis of Algorithms Amdahl’s Law s(p)  1 / [f+(1-f)/p] • f  0 : s (p)  p • f  1 : s(p)  1 • f = 0.5 : s(p) = 2 [p/(p+1)] <= 2 • f = 1/k : s(p) = k / [1+(k-1)/p] <= k Parallel Analysis of Algorithms

  10. Parallel Analysis of Algorithms s k Parallel Analysis of Algorithms

  11. Parallel Analysis of Algorithms Scaled or Relative Speedup • Ts may be unknown (in fact, for most real experiments this is the case) • Relative speedup s’ (p) = T(1) / T(p) • s’ (p)  s(p) Parallel Analysis of Algorithms

  12. Parallel Analysis of Algorithms Efficiency • e(p) = s(p) / p efficiency (0e1) • optimal linear speedup s(p) = p  e(p) = 1 • e’(p) = s’(p) / p Relative efficiency Parallel Analysis of Algorithms

  13. Outline • Parallel Analysis of Algorithms • Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

  14. Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory proc. 1 • Exclusive-Read (ER) • Concurrent-Read (CR) • Exclusive-Write (EW) • Concurrent-Write (CW) proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

  15. Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory • Concurrent-Write (CW) • Common: All proc. must write the same value • Arbitrary: An arbitrary value “wins” • Smallest: The smallest value “wins” • Priority: The proc. with smallest ID number “wins” proc. 1 proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

  16. Shared Memory (PRAM, SMP) Parallel Random Access Machine (PRAM) shared memory • Default: CREW (Concurrent Read Exclusive Write) • p = O(n) fine grained massively parallel proc. 1 proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

  17. Shared Memory (PRAM, SMP) Performance of a PRAM Algorithm • Optimal T = O ( Ts / p ) • Efficient T = O ( logk(n) Ts / p ) • NC T = O (logk(n) ) for p= polynomial (n) Parallel Analysis of Algorithms

  18. Shared Memory (PRAM, SMP) Example: Multiply n numbers shared memory proc. 1 • Input: a1, a2, …, an • Output: a1 * a2 * a3 * … * an * : associative operator proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p Parallel Analysis of Algorithms

  19. Shared Memory (PRAM, SMP) Algorithm 1 p = n/2 Parallel Analysis of Algorithms

  20. Shared Memory (PRAM, SMP) Analysis • p = n/2 T = O( log n ) • Ts = O(n), Ts / p = O(1)  algorithm is efficient & NC but not optimal Parallel Analysis of Algorithms

  21. Shared Memory (PRAM, SMP) Algorithm 2 • make available only p = n / log n processors • execute Algorithm 1 using “rescheduling”: whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n)  steps for (n / log n) processors Parallel Analysis of Algorithms

  22. Shared Memory (PRAM, SMP) proc Parallel Analysis of Algorithms

  23. Shared Memory (PRAM, SMP) Analysis # steps in phase i :  (n / 2i) / (n / log n)  =  log n / 2i T = O(1in log n / 2i ) = O( log n 1in 1/ 2i ) = O( log n ) p = n / log n Ts / p = O( n / [n / log n] ) = O( log n )  algorithm is efficient & NC & optimal Parallel Analysis of Algorithms

  24. Problem 2: List Ranking • Input: A linked list represented by an array. • Output: The distance of each node to the last node.

  25. Algorithm: Pointer Jumping • Assign proc. i to node i • Initialize (all proc. i in parallel): D(i) := 0 if P(i)=i 1 otherwise • REPEAT log n TIMES (all proc. i in parallel): D(i) := D(i) + D(P(i)) P(i) := P(P(i))

  26. Analysis • p = n • T = O( log n ) • efficient & NC but not optimal

  27. Problem 3: Partial Sums • Input: a1, a2, …, an • Output: a1 a1 + a2 a1 + a2 + a3 ... a1 + a2 + a3 + … + an

  28. Parallel Recursion • Compute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an • Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an • The result is: (a1+a2)(a1+a2+a3+a4)(a1+a2+a3+a4+a5+a6 )...(a1+a2... an-3+an-2)(a1+a2+an-1+an) • Compute each gap by multiplying its predecessor by a single number

  29. Analysis • p = n • T (n) = T(n/2) + O(1) T(1) = O(1)  T(n) = O(log n) efficient and NC but not optimal

  30. Improving through rescheduling • set p = n / log n • simulate previous algorithm

  31. proc

  32. Analysis • # steps in phase i : •  (n / 2i) / (n / log n)  =  log n / 2i • T = O(1in log n / 2i ) • = O( log n 1in 1/ 2i ) = O( log n ) • p = n / log n • Ts / p = O( n / [n / log n] ) = O( log n ) •  algorithm is efficient & NC & optimal

  33. Problem 4: Sorting shared memory proc. 1 • Input: a1, a2, …, an • Output: a1, a2, …, an permuted into sorted order proc. 2 1 2 proc. 3 ... j proc. i n-1 n ... proc. p

  34. Bitonic Sorting (Batcher) • Unimodal sequence: 9 10 13 17 21 19 16 15 • Bitonic sequence: cyclic shift of a unimodal sequence 16 15 9 10 13 17 21 19

  35. Properties of bitonic sequences • X = x1 x2... xn xn+1 xn+2 ... x2nbitonic • L(X) = y1 ... ynyi = min {xi, xn+i} U(X) = z1 ... znzi = max {xi, xn+i}  (1)L(X) and U(X) are bitonic (2) every element of L(X) is smaller than every element of U(X).

  36. Bitonic Merge: sorting a bitonic sequence • a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

  37. sorting an arbitrary sequence a1, a2, …, an • split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an • recursively, in parallel, sort each sub-sequence using p/2 processors • merge the two sorted sub-sequences into one sorted sequence using bitonic merge Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

  38. Analysis • p = n • T (n) = T(n/2) + O(log n) T(1) = O(1)  T(n) = O(log2 n) efficient and NC but not optimal

  39. So what about a SMP machine? • PRAM? • EREW? • CREW? • CRCW? • How does OpenMP play into this? Parallel Analysis of Algorithms

  40. Master Thread Parallel Regions OpenMP/SMP • = CREW PRAM but coarse grained • T(p)  f Ts + (1-f)Ts / p, for f = sequential fraction • T(n,p) = f Ts + sum over all parallel regions of max time fork Parallel Analysis of Algorithms

  41. Outline • Parallel Analysis of Algorithms • Models • Shared Memory (PRAM, SMP) • Distributed Memory (BSP, CGM) Parallel Analysis of Algorithms

  42. Distributed Memory Models

  43. Parallel Computing • p: # processors • n: problem size • Ts(n): sequential time • T(p,n): parallel time • speedup: S(p,n) = Ts(n) / T(p,n) • Goal: obtain linearspeedup S(p,n)=p

  44. Parallel Computers Beowulf Cluster Blue Gene/Q ... Cray XK7 Custom MPP (Tianhe-2)

  45. Parallel Machine Models How to abstract the machine into a simplified model such that • algorithm/application design is not hampered by too many details • calculated time complexity predictions match the actually observed running times (with sufficient accuracy)

  46. Parallel Machine Models • PRAM • Fine grained networks (array, ring, mesh, hypercube) • Bulk Synchronous Parallelism (BSP), Valiant, 1990 • Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993 • Multithread (CILK), Leiserson, 1995 • many more...

  47. p=O(n) processors massively parallel... PRAM

  48. list merge… Bitonic Sort: O(log n) per merge => O(log2 n) Cole: O(1) per merge => O(log n) Example: PRAM Sort

  49. p=O(n) processors massively parallel ... Fine Grained Networks

More Related