1 / 106

Parallel Algorithms

Parallel Algorithms. Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013. Announcements. Project 1 Due Thursday 09/ 19 Reminders Commit often Make a great README.md Philly Transit Hackathon this weekend http://www.meetup.com/Code-for-America-Philly/events/136363492 /.

Download Presentation

Parallel Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2013

  2. Announcements • Project 1 • Due Thursday 09/19 • Reminders • Commit often • Make a great README.md • Philly Transit Hackathon this weekend • http://www.meetup.com/Code-for-America-Philly/events/136363492/

  3. Review SP, SM Kernel, thread, warp, block, grid

  4. Agenda • Parallel Algorithms • Parallel Reduction • Scan • Stream Compression • Summed Area Tables • Radix Sort

  5. Parallel Reduction • Given an array of numbers, design a parallel algorithm to find the sum. • Consider: • Arithmetic intensity: compute to memory access ratio

  6. Parallel Reduction • Given an array of numbers, design a parallel algorithm to find: • The sum • The maximum value • The product of values • The average value • How different are these algorithms?

  7. Parallel Reduction • Reduction: An operation that computes a single result from a set of data • Examples: • Minimum/maximum value • Average, sum, product, etc. • Parallel Reduction: Do it in parallel. Obviously

  8. Parallel Reduction • Example. Find the sum: 0 1 2 3 4 5 6 7

  9. Parallel Reduction 0 1 2 3 4 5 6 7 1 5 9 13

  10. Parallel Reduction 0 1 2 3 4 5 6 7 1 5 9 13 6 22

  11. Parallel Reduction 0 1 2 3 4 5 6 7 1 5 9 13 6 22 28

  12. Parallel Reduction Similar to brackets for a basketball tournament log(n) passes for n elements

  13. All-Prefix-Sums • All-Prefix-Sums • Input • Array of n elements: • Binary associate operator: • Identity: I • Outputs the array: Images from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

  14. All-Prefix-Sums • Example • If is addition, the array • [3 1 7 0 4 1 6 3] • is transformed to • [0 3 4 11 11 15 16 22] • Seems sequential, but there is an efficient parallel solution

  15. Scan • Scan: all-prefix-sums operation on an array of data • Exclusive Scan: Element j of the result does not include element j of the input: • In: [3 1 7 0 4 1 6 3] • Out: [0 3 4 11 11 15 16 22] • Inclusive Scan (Prescan): All elements including j are summed • In: [3 1 7 0 4 1 6 3] • Out: [3 4 11 11 15 16 22 25]

  16. Scan • How do you generate an exclusive scan from an inclusive scan? • Input: [3 1 7 0 4 1 6 3] • Inclusive: [3 4 11 11 15 16 22 25] • Exclusive: [0 3 4 11 11 15 16 22] • // Shift right, insert identity • How do you go in the opposite direction?

  17. Scan • Use cases • Stream compaction • Summed-area tables for variable width image processing • Radix sort • …

  18. Scan • Used to convert certain sequential computation into equivalent parallel computation Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

  19. Scan • Design a parallel algorithm for exclusive scan • In: [3 1 7 0 4 1 6 3] • Out: [0 3 4 11 11 15 16 22] • Consider: • Total number of additions

  20. Scan • Sequential Scan: single thread, trivial • n adds for an array of length n • How many adds will our parallel version have? Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

  21. Scan • Naive Parallel Scan for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k]; • Is this exclusive or inclusive? • Each thread • Writes one sum • Reads two values Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

  22. Scan • Naive Parallel Scan: Input 0 1 2 3 4 5 6 7

  23. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  24. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  25. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  26. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  27. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 7 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  28. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 7 9 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  29. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  30. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  31. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 • Recall, it runs in parallel! for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  32. Scan • Naive Parallel Scan: d = 1, 2d-1 = 1 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 • Recall, it runs in parallel! for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  33. Scan • Naive Parallel Scan: d = 2, 2d-1 = 2 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 after d = 1 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  34. Scan • Naive Parallel Scan: d = 2, 2d-1 = 2 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 after d = 1 22 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k]; • Consider only k = 7

  35. Scan • Naive Parallel Scan: d = 2, 2d-1 = 2 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 after d = 1 after d = 2 0 1 3 6 10 14 18 22 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  36. Scan • Naive Parallel Scan: d = 3, 2d-1 = 4 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 after d = 1 after d = 2 0 1 3 6 10 14 18 22 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];

  37. Scan • Naive Parallel Scan: d = 3, 2d-1 = 4 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 after d = 1 after d = 2 0 1 3 6 10 14 18 22 28 for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k]; • Consider only k = 7

  38. Scan • Naive Parallel Scan: Final 0 1 2 3 4 5 6 7 0 1 3 5 7 9 11 13 0 1 3 6 10 14 18 22 0 1 3 6 10 15 21 28

  39. Stream Compaction • Stream Compaction • Given an array of elements • Create a new array with elements that meet a certain criteria, e.g. non null • Preserve order a b c d e f g h

  40. Stream Compaction • Stream Compaction • Given an array of elements • Create a new array with elements that meet a certain criteria, e.g. non null • Preserve order a b c d e f g h a c d g

  41. Stream Compaction • Stream Compaction • Used in path tracing, collision detection, sparse matrix compression, etc. • Can reduce bandwidth from GPU to CPU a b c d e f g h a c d g

  42. Stream Compaction • Stream Compaction • Step 1: Compute temporary array containing • 1 if corresponding element meets criteria • 0 if element does not meet criteria a b c d e f g h

  43. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h 1

  44. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h 1 0

  45. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h 1 0 1

  46. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h 1 0 1 1 0 0 1 0

  47. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h • It runs in parallel!

  48. Stream Compaction • Stream Compaction • Step 1: Compute temporary array a b c d e f g h 1 0 1 1 0 0 1 0 • It runs in parallel!

  49. Stream Compaction • Stream Compaction • Step 2: Run exclusive scan on temporary array a b c d e f g h 1 0 1 1 0 0 1 0 Scan result:

  50. Stream Compaction • Stream Compaction • Step 2: Run exclusive scan on temporary array • Scan runs in parallel • What can we do with the results? a b c d e f g h 1 0 1 1 0 0 1 0 0 1 1 2 3 3 3 4 Scan result:

More Related