1 / 61

L ECTURE 2

L ECTURE 2. Recurrences (Review). Matrix Multiplication. Merge Sort. Tableau Construction. Conclusion. Algorithmic Complexity Measures. T P = execution time on P processors. T 1 = work. T 1 = span *. L OWER B OUNDS T P ¸ T 1 / P T P ¸ T 1.

ghada
Download Presentation

L ECTURE 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LECTURE 2 • Recurrences (Review) • Matrix Multiplication • Merge Sort • Tableau Construction • Conclusion

  2. Algorithmic Complexity Measures TP = execution time on P processors T1 = work T1 = span* LOWER BOUNDS • TP¸T1/P • TP¸T1 * Also called critical-path length or computational depth.

  3. Speedup If T1/TP= (P) ·P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound TP¸T1/P. Definition:T1/TP= speedupon P processors.

  4. Parallelism Because we have the lower bound TP¸T1, the maximum possible speedup given T1 and T1 is T1/T1= parallelism = the average amount of work per step along the span.

  5. The Master Method The Master Method for solving recurrences applies to recurrences of the form T(n) = aT(n/b) + f(n), where a¸ 1, b > 1, and f is asymptotically positive. * IDEA: Compare nlogba with f(n). * The unstated base case is T(n) = (1) for sufficiently small n.

  6. Master Method — CASE 1 T(n) = aT(n/b) + f(n) nlogbaÀ f (n) Specifically, f(n) = O(nlogba – e) for some constant e > 0. Solution:T(n) = Q(nlogba) .

  7. Master Method — CASE 2 T(n) = aT(n/b) + f(n) nlogba¼ f (n) Specifically, f(n) = (nlogbalgkn) for some constant k¸ 0. Solution:T(n) = Q(nlogba lgk+1n) .

  8. Master Method — CASE 3 T(n) = aT(n/b) + f(n) nlogba¿ f (n) Specifically, f(n) = (nlogba + e) for some constant e > 0 andf(n)satisfies the regularity conditionthat af(n/b) ·cf(n)for some constant c < 1. Solution:T(n) = Q(f(n)) .

  9. Master Method Summary T(n) = aT(n/b) + f(n) • CASE 1:f(n) = O(nlogba – e), constant  > 0 •  T(n) = (nlogba) . • CASE 2:f(n) = Q(nlogba lgkn), constantk 0 • T(n) = Q(nlogba lgk+1n) . • CASE 3:f(n) = W(nlogba + e ), constant  > 0, and regularity condition •  T(n) = Q(f(n)) .

  10. Master Method Quiz • T(n) = 4 T(n/2) + n • nlogba = n2Àn)CASE 1:T(n) = (n2). • T(n) = 4 T(n/2) + n2 • nlogba = n2 = n2lg0n)CASE 2:T(n) = (n2lg n). • T(n) = 4 T(n/2) + n3 • nlogba = n2¿n3)CASE 3:T(n) = (n3). • T(n) = 4 T(n/2) + n2/lg n • Master method does not apply!

  11. LECTURE 2 • Recurrences (Review) • Matrix Multiplication • Merge Sort • Tableau Construction • Conclusion

  12. Square-Matrix Multiplication bn1 b21 b11 an1 a21 a11 c11 c21 cn1 an2 a22 a12 b22 bn2 b12 cn2 c12 c22 ann a2n b1n b2n bnn a1n cnn c2n c1n L L L L L L L L L M M M M M M M M M O O O n  cij = aik bkj k= 1 = £ C A B Assume for simplicity that n = 2k.

  13. Recursive Matrix Multiplication C11 C12 A11 A12 B11 B12 C21 C22 A21 A22 B21 B22 A11B11 A11B12 A12B21 A12B22 = + A21B11 A21B12 A22B21 A22B22 Divide and conquer — = £ 8 multiplications of (n/2) £(n/2) matrices. 1 addition of n £n matrices.

  14. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Absence of type declarations. C = A¢ B

  15. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Coarsen base cases for efficiency. C = A¢ B

  16. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } Also need a row-size argument for array indexing. Submatrices are produced by pointer calculation, not copying of elements. C = A¢ B

  17. Matrix Multiply in Pseudo-Cilk cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2); spawn Mult(C22,A21,B12,n/2); spawn Mult(C21,A21,B11,n/2); spawn Mult(T11,A12,B21,n/2); spawn Mult(T12,A12,B22,n/2); spawn Mult(T22,A22,B22,n/2); spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilkvoid Add(*C, *T, n) { h base case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } C = A¢ B C = C + T

  18. Work of Matrix Addition cilk void Add(*C, *T, n) { hbase case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } Work: A1(n) = ? 4A1(n/2) + (1) — CASE 1 = (n2) nlogba = nlog24 = n2À(1).

  19. Span of Matrix Addition cilk void Add(*C, *T, n) { hbase case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } cilk void Add(*C, *T, n) { hbase case & partition matrices i spawn Add(C11,T11,n/2); spawn Add(C12,T12,n/2); spawn Add(C21,T21,n/2); spawn Add(C22,T22,n/2); sync; return; } maximum Span: A1(n) = ? A1(n/2) + (1) —CASE 2 = (lg n) nlogba = nlog21 = 1)f(n) = (nlogba lg0n) .

  20. Work of Matrix Multiplication cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 Work: M1(n) = ? 8M1(n/2) +A1(n) + (1) = 8M1(n/2) + (n2) = (n3) — CASE 1 nlogba = nlog28 = n3À(n2).

  21. Span of Matrix Multiplication cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } cilk void Mult(*C, *A, *B, n) { float *T = Cilk_alloca(n*n*sizeof(float)); hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } 8 Span: M1(n) = ? M1(n/2) + A1(n) + (1) = M1(n/2) + (lg n) =(lg2 n) — CASE 2 nlogba = nlog21 = 1)f(n) = (nlogba lg1n) .

  22. Parallelism of Matrix Multiply Work: M1(n) = (n3) Span: M1(n) = (lg2n) Parallelism: M1(n) = (n3/lg2n) M1(n) For 1000£1000 matrices, parallelism ¼ (103)3/102 = 107.

  23. Stack Temporaries cilk void Mult(*C, *A, *B, n) { hbase case & partition matrices i spawn Mult(C11,A11,B11,n/2); spawn Mult(C12,A11,B12,n/2);  spawn Mult(T21,A22,B21,n/2); sync; spawn Add(C,T,n); sync; return; } float *T = Cilk_alloca(n*n*sizeof(float)); In hierarchical-memory machines (especially chip multiprocessors), memory accesses are so expensive that minimizing storage often yields higher performance. IDEA: Trade off parallelism for less storage.

  24. No-Temp Matrix Multiplication cilk void MultA(*C, *A, *B, n) { // C = C + A * B hbase case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawnMultA(C12,A12,B22,n/2); spawnMultA(C11,A12,B21,n/2); sync; return; } Saves space, but at what expense?

  25. Work of No-Temp Multiply cilk void MultA(*C, *A, *B, n) { // C = C + A * B hbase case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawnMultA(C12,A12,B22,n/2); spawnMultA(C11,A12,B21,n/2); sync; return; } Work: M1(n) = ? 8 M1(n/2) + (1) — CASE 1 =(n3)

  26. Span of No-Temp Multiply cilk void MultA(*C, *A, *B, n) { // C = C + A * B hbase case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawnMultA(C12,A12,B22,n/2); spawnMultA(C11,A12,B21,n/2); sync; return; } cilk void MultA(*C, *A, *B, n) { // C = C + A * B hbase case & partition matrices i spawn MultA(C11,A11,B11,n/2); spawn MultA(C12,A11,B12,n/2); spawn MultA(C22,A21,B12,n/2); spawn MultA(C21,A21,B11,n/2); sync; spawn MultA(C21,A22,B21,n/2); spawn MultA(C22,A22,B22,n/2); spawn MultA(C12,A12,B22,n/2); spawnMultA(C11,A12,B21,n/2); sync; return; } maximum maximum Span: M1(n) = ? 2 M1(n/2) + (1) — CASE 1 =(n)

  27. Parallelism of No-Temp Multiply Work: M1(n) = (n3) Span: M1(n) = (n) Parallelism: M1(n) = (n2) M1(n) For 1000£1000 matrices, parallelism ¼ (103)3/103 = 106. Faster in practice!

  28. Testing Synchronization • Cilk language feature: A programmer can check whether a Cilk procedure is “synched” (without actually performing a sync) by testing the pseudovariableSYNCHED: • SYNCHED= 0) some spawned children might not have returned. • SYNCHED= 1) all spawned children have definitely returned.

  29. Best of Both Worlds cilk void Mult1(*C, *A, *B, n) {// multiply & store hbase case & partition matrices i spawn Mult1(C11,A11,B11,n/2); // multiply & store spawn Mult1(C12,A11,B12,n/2); spawn Mult1(C22,A21,B12,n/2); spawn Mult1(C21,A21,B11,n/2); if (SYNCHED) { spawn MultA1(C11,A12,B21,n/2); // multiply & add spawn MultA1(C12,A12,B22,n/2); spawn MultA1(C22,A22,B22,n/2); spawn MultA1(C21,A22,B21,n/2); } else { float *T = Cilk_alloca(n*n*sizeof(float)); spawn Mult1(T11,A12,B21,n/2); // multiply & store spawn Mult1(T12,A12,B22,n/2); spawn Mult1(T22,A22,B22,n/2); spawn Mult1(T21,A22,B21,n/2); sync; spawn Add(C,T,n); // C = C + T } sync; return; } This code is just as parallel as the original, but it only uses more space if runtime parallelism actually exists.

  30. Ordinary Matrix Multiplication IDEA: Spawn n2inner products in parallel. Compute each inner product in parallel. Work: (n3) Span: (lgn) Parallelism: (n3/lgn) n  k = 1 cij = aik bkj BUT, this algorithm exhibits poor locality and does not exploit the cache hierarchy of modern microprocessors, especially CMP’s.

  31. LECTURE 2 • Recurrences (Review) • Matrix Multiplication • Merge Sort • Tableau Construction • Conclusion

  32. Merging Two Sorted Arrays 3 12 19 46 4 14 21 23 void Merge(int *C, int *A, int *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; } } while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } } Time to merge n elements = ? Q(n). 3 12 19 46 4 14 21 23

  33. Merge Sort 3 4 12 14 19 21 33 46 3 12 19 46 4 14 21 33 3 19 12 46 4 33 14 21 cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } merge merge merge 19 3 12 46 33 4 21 14

  34. Work of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } Work: T1(n) = ? 2T1(n/2) + (n) — CASE 2 = (n lg n) nlogba = nlog22 = n)f(n) = (nlogba lg0n) .

  35. Span of Merge Sort cilk void MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn MergeSort(C, A, n/2); spawn MergeSort(C+n/2, A+n/2, n-n/2); sync; Merge(B, C, C+n/2, n/2, n-n/2); } } Span: T1(n) = ? T1(n/2) + (n) — CASE 3 = (n) nlogba = nlog21 = 1¿ (n) .

  36. Parallelism of Merge Sort Work: T1(n) = (n lg n) PUNY! Span: T1(n) = (n) Parallelism: T1(n) = (lg n) T1(n) We need to parallelize the merge!

  37. Parallel Merge na/2 ·A[na/2] ¸A[na/2] Recursive merge Recursive merge Binary search ·A[na/2] ¸A[na/2] j j+1 0 na A na¸nb B 0 nb KEY IDEA:If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most ? (3/4)n .

  38. Parallel Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) { spawn P_Merge(C, B, A, nb, na); } else if (na==1) { if (nb == 0) { C[0] = A[0]; } else { C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */ C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */ } } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } Coarsen base cases for efficiency.

  39. Span of P_Merge cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } Span: T1(n) = ? T1(3n/4) + (lg n) — CASE 2 = (lg2n) nlogba = nlog4/31 = 1)f(n) = (nlogba lg1n) .

  40. Work of P_Merge HAIRY! cilk void P_Merge(int *C, int *A, int *B, int na, int nb) { if (na < nb) {  } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); spawn P_Merge(C, A, B, ma, mb); spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb); sync; } } T1(n) + T1((1–)n) + (lg n),where 1/4 ·· 3/4 . T1(n) = ? Work: CLAIM:T1(n) = (n) .

  41. Analysis of Work Recurrence T1(n) = T1(n) + T1((1–)n) + (lg n), where 1/4 ·· 3/4 . Substitution method: Inductive hypothesis is T1(k) ·c1k – c2lg k, wherec1,c2 > 0. Prove that the relation holds, and solve for c1 and c2. T1(n) = T1(n) + T1((1–)n) + (lg n) · c1(n) – c2lg(n)+ c1((1–)n) – c2lg((1–)n)+ (lg n)

  42. Analysis of Work Recurrence T1(n) = T1(n) + T1((1–)n) + (lg n), where 1/4 ·· 3/4 . T1(n) = T1(n) + T1((1–)n) + (lg n) · c1(n) – c2lg(n)+ c1(1–)n – c2lg((1–)n)+ (lg n)

  43. Analysis of Work Recurrence T1(n) = T1(n) + T1((1–)n) + (lg n), where 1/4 ·· 3/4 . T1(n) = T1(n) + T1((1–)n) + (lg n) · c1(n) – c2lg(n)+ c1(1–)n – c2lg((1–)n)+ (lg n) · c1n – c2lg(n)– c2lg((1–)n)+ (lg n) · c1n – c2 ( lg((1–)) + 2lg n )+ (lg n) · c1n – c2 lg n – (c2(lg n+ lg((1–))) – (lg n)) · c1n – c2 lg n by choosing c1 and c2 large enough.

  44. Parallelism of P_Merge Work: T1(n) = (n) Span: T1(n) = (lg2n) T1(n) = (n/lg2n) T1(n) Parallelism:

  45. Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } }

  46. Work of Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } } Work: T1(n) = 2 T1(n/2) + (n) — CASE 2 = (n lg n)

  47. Span of Parallel Merge Sort cilk void P_MergeSort(int *B, int *A, int n) { if (n==1) { B[0] = A[0]; } else { int *C; C = (int*) Cilk_alloca(n*sizeof(int)); spawn P_MergeSort(C, A, n/2); spawn P_MergeSort(C+n/2, A+n/2, n-n/2); sync; spawn P_Merge(B, C, C+n/2, n/2, n-n/2); } } Span: T1(n) = ? T1(n/2) + (lg2n) — CASE 2 = (lg3n) nlogba = nlog21 = 1)f(n) = (nlogba lg2n) .

  48. Parallelism of Merge Sort T1(n) = (n/lg2n) T1(n) Work: T1(n) = (n lg n) Span: T1(n) = (lg3n) Parallelism:

  49. LECTURE 2 • Recurrences (Review) • Matrix Multiplication • Merge Sort • Tableau Construction • Conclusion

  50. Tableau Construction Problem: Fill in an n£n tableau A, where A[i, j] = f(A[i, j–1], A[i–1, j], A[i–1, j–1]). 00 01 02 03 04 05 06 07 • Dynamic programming • Longest common subsequence • Edit distance • Time warping 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 Work:(n2). 70 71 72 73 74 75 76 77

More Related