1 / 36

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

Chapter 4: First Steps Toward Parallel Programming. Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder. Toward writing parallel programs. Build intuition toward parallelism When to parallelize When overhead is too great Consider Data allocation

Download Presentation

Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4:First Steps Toward Parallel Programming Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder

  2. Toward writing parallel programs • Build intuition toward parallelism • When to parallelize • When overhead is too great • Consider • Data allocation • Work allocation • Data structure design • Algorithms

  3. 3 ways to formulate parallel computations Unlimited Parallelism Fixed Parallelism Scalable Parallelsim

  4. 2 classes of parallel algorithms Data parallel Task parallel

  5. Data parallel • Perform same computation to different data items at the same time. • Parallelism grows as data grows • Example • P chefs preparing N meals • Each chef prepares N/P meals • As N increases, also increase P, limited by constraints

  6. Task parallel • Perform distinct computations at the same time • Number of tasks typically fixed • Not scalable • Example • Chef for salad, chef for dessert, chef appetizer • There are dependencies among tasks • Utilizes pipelining • Hybred of data and task is often used

  7. Pseudo code – Peril-L Minimal, easy to learn Universal to any language Allow reasoning about performance Will extend C

  8. Perl-L • Threads • forall (i in (1..12)) printf(“Hello %i\n”,i); • Prints 12 Hello’s in random order • Threads compete and execute in parallel

  9. Perl-L • exclusive • One thread executes body at a time forall (i in (1..12)){ exclusive { printf(“Hello %i\n”,i); }} • barrier • Forces all threads to stop at the barrier until all threads arrive at which point they continue

  10. Perl-L • barrier • All threads wait for all to arrive, then continue forall (i in (1..12)) { printf(“tweedle dee \n”); barrier; printf(“tweedle dum \n”); } • All tweedle dee’s print before tweedle dum’s

  11. Peril-l memory model • Global • Variables visible to all threads • Outside a forall • Variables underlined • Local • Variables visible to only local thread • Inside a forall • Variables not underlined

  12. Perl-l • Multiple reads concurrent • One write • Allows race conditions, last write wins

  13. Connecting global and local memory • Global memory is distributed to local memory • Localize takes global memory to make it local intallData[n]; // global forall (thdID in (0..P-1)) { // spawn threads int size = n/P; // size of allocations intlocData[size]=localize(allData[]); // map globals to this thd locals

  14. Connecting global and local memory (cont) Modification to local data is same as modifying global data but with out λ delay of accessing nonlocal memory

  15. Issues of localization of global memory Global arrays use local indices which start at 0 Multiple threads on a processor keep data local to the thread There is no local copy, both local and global reference the same memory location?

  16. Handy functions • size = mySize(global,i) • Feturns the size of the ith dimension of the local portion of the global array • localToGlobal(locData, i, j) • Returns global index corresponds to ith index of the jth dimension of the local array, locData

  17. Full Empty variables - synchronization Like matter, next slide Incurs over head like global memory, λ int t’=0; //declare empty t and fill it

  18. Table 4.1 Semantics of full/empty variables.

  19. Reduce/Scan • Reduce – combines a set of values to produce a single value • Written with / • +/count //add elements of count • Scan – parallel prefix computation, embodies logic that performs a sequential operation in parts and carries along the intermediate results • Written with \ • Min\items //scan, ie find smallest of items’ prefixs

  20. Additional example least = min/dataArray; //scalar stored in local //least of each thread. reduce/scan can combine values across multiple threads

  21. More examples - reduce • count – local in each thread total=+/count; • Combined into a single result stored in each thread

  22. More examples - scan count local to each thread beforeMe =+\count; count variables are accumulate so the ith thread has its beforeMe variable assigned the sum of the first i count values

  23. Implied Reduce - Scan synchronization Consider largest = max/localTotal; All threads must arrive at this statement to perform the summation. Threads proceed only after the assignment

  24. Programming consideration exclusive { total +=priv_count; } //done serially Versus Total =+/priv_count; //done with tree structure Converts from O(p) to O(lg P)

  25. Figure 4.1 The Count 3s computation (Try 3) written in the Peril-L notation.

  26. Formulating Parallelism • Fixed Parallelism • Write code designed for a particular machine • Improving the machine may not increase parallelism • Unlimited Parallelism • Use forall ( i in (0 .. n-1) • Will use available resources • Will require substantial thread communication

  27. Figure 4.2 Fixed Parallelism solution to Count 3s (t=4).

  28. Formulating Parallelism (cont) • Scalable • As follows: • Determine how components (data structures, work load, etc) grow as n increases. • Formulate a set S of substantial subproblems where natural units of the solution are assigned to each S • Solve each S independently • Utilizes locality

  29. Figure 4.3 Scalable Parallelism solution to Count 3s. Notice that the array segment has been localized.

  30. Table 4.2 Helper functions.

  31. Figure 4.4 Odd/Even Interchange to alphabetize a list L of records on field x.

  32. Figure 4.5 Fixed 26-way parallel solution to alphabetizing. The function letRank(x) returns the 0-origin rank of the Latin letter x.

  33. Figure 4.6

  34. Table 4.3 Merge operations.

  35. Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L.

  36. Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L. (cont.)

More Related