1 / 62

Automatic Parallelization of Divide and Conquer Algorithms

Automatic Parallelization of Divide and Conquer Algorithms. Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. Outline. Example Information required to parallelize divide and conquer algorithms How compiler extracts parallelism

wray
Download Presentation

Automatic Parallelization of Divide and Conquer Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Parallelization of Divide and Conquer Algorithms Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

  2. Outline • Example • Information required to parallelize divide and conquer algorithms • How compiler extracts parallelism • Key technique: constraint systems • Results • Related work • Conclusion

  3. Example - Divide and Conquer Sort 7 4 6 1 3 5 8 2

  4. 8 2 7 4 6 1 3 5 Example - Divide and Conquer Sort 7 4 6 1 3 5 8 2 Divide

  5. 8 2 7 4 6 1 3 5 Example - Divide and Conquer Sort 7 4 6 1 3 5 8 2 Divide 4 7 1 6 3 5 2 8 Conquer

  6. 8 2 7 4 6 1 3 5 1 4 6 7 2 3 5 8 Example - Divide and Conquer Sort 7 4 6 1 3 5 8 2 Divide 4 7 1 6 3 5 2 8 Conquer Combine

  7. 7 4 6 1 3 5 8 2 8 2 7 4 6 1 3 5 4 7 1 6 3 5 2 8 1 4 6 7 2 3 5 8 1 2 3 4 5 6 7 8 Example - Divide and Conquer Sort Divide Conquer Combine

  8. Divide and Conquer Algorithms • Lots of Generated Concurrency • Solve Subproblems in Parallel

  9. Divide and Conquer Algorithms • Lots of Generated Concurrency • Solve Subproblems in Parallel

  10. Divide and Conquer Algorithms • Lots of Recursively Generated Concurrency • Recursively Solve Subproblems in Parallel

  11. Divide and Conquer Algorithms • Lots of Recursively Generated Concurrency • Recursively Solve Subproblems in Parallel • Combine Results in Parallel

  12. Divide and Conquer Algorithms • Lots of Recursively Generated Concurrency • Recursively Solve Subproblems in Parallel • Combine Results in Parallel • Good Cache Performance • Problems Naturally Scale to Fit in Cache • No Cache Size Constants in Code

  13. Divide and Conquer Algorithms • Lots of Recursively Generated Concurrency • Recursively Solve Subproblems in Parallel • Combine Results in Parallel • Good Cache Performance • Problems Naturally Scale to Fit in Cache • No Cache Size Constants in Code • Lots of Programs • Sort Programs • Dense Matrix Programs

  14. “Sort n Items in d, Using t as Temporary Storage” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n);

  15. 7 4 6 1 3 5 8 2 “Recursively Sort Four Quarters of d” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); Subproblems Identified Using Pointers Into Middle of Array d d+n/4 d+n/2 d+3*(n/4)

  16. 4 7 1 6 3 5 2 8 “Recursively Sort Four Quarters of d” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); Sorted Results Written Back Into Input Array d d+n/4 d+n/2 d+3*(n/4)

  17. 4 1 4 7 1 6 6 7 3 2 3 5 2 5 8 8 “Merge Sorted Quarters of d Into Halves of t” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); d t t+n/2

  18. 1 1 4 2 6 3 4 7 5 2 6 3 5 7 8 8 “Merge Sorted Halves of t Back Into d” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); t t+n/2 d

  19. 7 4 6 1 3 5 8 2 “Use a Simple Sort for Small Problem Sizes” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); d d+n

  20. 7 4 1 6 3 5 8 2 “Use a Simple Sort for Small Problem Sizes” void sort(int *d, int *t, int n) if (n > CUTOFF) { sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); d d+n

  21. Parallel Execution void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+n/2,t+n/2,n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync; merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n);

  22. What Do You Need to Know to Exploit this Form of Parallelism?

  23. What Do You Need to Know to Exploit this Parallelism? Calls to sort access disjoint parts of d and t Together, calls access [d,d+n-1] and [t,t+n-1] sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); d d+n-1 t t+n-1 d d+n-1 t t+n-1 d d+n-1 t t+n-1 d d+n-1 t t+n-1

  24. What Do You Need to Know to Exploit this Parallelism? First two calls to merge access disjoint parts of d,t Together, calls access [d,d+n-1] and [t,t+n-1] merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); d d+n-1 t t+n-1 d d+n-1 t t+n-1 d d+n-1 t t+n-1

  25. What Do You Need to Know to Exploit this Parallelism? Calls to insertionSort access [d,d+n-1] insertionSort(d,d+n); d d+n-1 t t+n-1

  26. What Do You Need to Know to Exploit this Parallelism?The Regions of Memory Accessed by Complete Executions of Procedures

  27. How Hard Is it to Extract these Regions?

  28. How Hard Is it to Extract these Regions? Challenging

  29. How Hard Is it to Extract these Regions? insertionSort(int *l, int *h) { int *p, *q, k; for (p = l+1; p < h; p++) { for (k = *p, q = p-1; l <= q && k < *q; q--) *(q+1) = *q; *(q+1) = k; } } Not Immediately Obvious That insertionSort(l,h) Accesses [l,h-1]

  30. How Hard Is it to Extract these Regions? merge(int *l1, int*m, int *h2, int *d) { int *h1 = m; int *l2 = m; while ((l1 < h1) && (l2 < h2)) if (*l1 < *l2) *d++ = *l1++; else *d++ = *l2++; while (l1 < h1) *d++ = *l1++; while (l2 < h2) *d++ = *l2++; } Not Immediately Obvious That merge(l,m,h,d) Accesses [l,h-1] and [d,d+(h-l)-1]

  31. Issues • Pervasive Use of Pointers • Pointers into Middle of Arrays • Pointer Arithmetic • Pointer Comparison • Multiple Procedures • sort(int *d, int *t, n) • insertionSort(int *l, int *h) • merge(int *l, int *m, int *h, int *t) • Recursion

  32. How The Compiler Does It

  33. Structure of Compiler Pointer Analysis Disambiguate References at Granularity of Arrays Symbolic Upper and Lower Bounds for Each Memory Access in Each Procedure Bounds Analysis Region Analysis Symbolic Regions Accessed By Execution of Each Procedure Parallelization Independent Procedure Calls That Can Execute in Parallel

  34. Example f(char *p, int n) if (n > CUTOFF) { f(p, n/2); initialize first half f(p+n/2, n/2); initialize second half } else { base case: initialize small array int i = 0; while (i < n) { *(p+i) = 0; i++; } }

  35. Bounds Analysis • For each variable at each program point, derive upper and lower bounds for value • Bounds are symbolic expressions • symbolic variables in expressions represent initial values of parameters • linear combinations of these variables • multivariate polynomials

  36. Bounds Analysis What are upper and lower bounds for region accessed by while loop in base case? int i = 0; while (i < n) { *(p+i) = 0; i++; }

  37. Bounds Analysis, Step 1 Build control flow graph i = 0 i < n *(p+i) = 0; i= i+1

  38. Bounds Analysis, Step 2 Number different versions of variables i0 = 0 i1 < n *(p+i2) = 0; i3 = i2 +1

  39. Bounds Analysis, Step 3 Set up constraints for lower bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n *(p+i2) = 0; i3 = i2 +1

  40. Bounds Analysis, Step 3 Set up constraints for lower bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n *(p+i2) = 0; i3 = i2 +1

  41. Bounds Analysis, Step 3 Set up constraints for lower bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n *(p+i2) = 0; i3= i2 +1

  42. Bounds Analysis, Step 4 Set up constraints for upper bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n 0 <= u(i0) u(i0)<= u(i1) u(i3)<= u(i1) min(u(i1),n-1)<= u(i2) u(i2)+1<= u(i3) *(p+i2) = 0; i3 = i2 +1

  43. Bounds Analysis, Step 4 Set up constraints for upper bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n 0 <= u(i0) u(i0)<= u(i1) u(i3)<= u(i1) min(u(i1),n-1)<= u(i2) u(i2)+1<= u(i3) *(p+i2) = 0; i3 = i2 +1

  44. Bounds Analysis, Step 4 Set up constraints for upper bounds l(i0) <= 0 l(i1) <= l(i0) l(i1) <= l(i3) l(i2) <= l(i1) l(i3) <= l(i2)+1 i0 = 0 i1 < n 0 <= u(i0) u(i0)<= u(i1) u(i3)<= u(i1) n-1<= u(i2) u(i2)+1<= u(i3) *(p+i2) = 0; i3 = i2 +1

  45. Bounds Analysis, Step 5 Generate symbolic expressions for bounds Goal: express bounds in terms of parameters l(i0) = c1p + c2n + c3 l(i1) = c4p + c5n + c6 l(i2) = c7p + c8n + c9 l(i3) = c10p + c11n + c12 u(i0) = c13p + c14n + c15 u(i1) = c16p + c17n + c18 u(i2) = c19p + c20n + c21 u(i3) = c22p + c23n + c24

  46. Bounds Analysis, Step 6 Substitute expressions into constraints c1p + c2n + c3 <= 0 c4p + c5n + c6<= c1p + c2n + c3 c4p + c5n + c6<= c10p + c11n + c12 c7p + c8n + c9<= c4p + c5n + c6 c10p + c11n + c12<= c7p + c8n + c9+1 0 <= c13p + c14n + c15 c13p + c14n + c15<= c16p + c17n + c18 c22p + c23n + c24<= c16p + c17n + c18 n-1<= c19p + c20n + c21 c19p + c20n + c21+1 <= c22p + c23n + c24

  47. GoalSolve Symbolic Constraint Systemfind values for constraint variables c1, ..., c24 that satisfy the inequality constraintsMaximize Lower BoundsMinimize Upper Bounds

  48. Bounds Analysis, Step 7 Apply expression ordering principle c1p + c2n + c3<= c4p + c5n + c6 If c1<= c4, c2<= c5, and c3<= c6

  49. Bounds Analysis, Step 7 Apply expression ordering principle Generate a linear program Objective Function: max (c1 + ••• + c12) - (c13 + ••• + c24) c1 <= 0 c2 <= 0 c3<= 0 c4 <= c1 c5 <= c2 c6<= c3 c4<= c10 c5<= c11 c6<= c12 c7 <= c4 c8 <= c5 c9<= c6 c10 <= c7 c11 <= c8 c12<= c9+1 0 <= c130 <= c140 <= c15 c13<= c16 c14<= c17 c15<= c18 c22<= c16 c23<= c17 c24<= c18 0<= c19 1 <= c20 -1 <= c21 c19 <= c22 c20 <= c23 c21+1 <= c24 lower bounds upper bounds

  50. Bounds Analysis, Step 8 Solve linear program to extract bounds l(i0) = 0 l(i1) = 0 l(i2) = 0 l(i3) = 0 i0 = 0 i1 < n u(i0) = 0 u(i1) = n u(i2) = n-1 u(i3) = n *(p+i2) = 0; i3 = i2 +1

More Related