370 likes | 408 Views
Chapter 4: First Steps Toward Parallel Programming. Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder. Toward writing parallel programs. Build intuition toward parallelism When to parallelize When overhead is too great Consider Data allocation
E N D
Chapter 4:First Steps Toward Parallel Programming Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder
Toward writing parallel programs • Build intuition toward parallelism • When to parallelize • When overhead is too great • Consider • Data allocation • Work allocation • Data structure design • Algorithms
3 ways to formulate parallel computations Unlimited Parallelism Fixed Parallelism Scalable Parallelsim
2 classes of parallel algorithms Data parallel Task parallel
Data parallel • Perform same computation to different data items at the same time. • Parallelism grows as data grows • Example • P chefs preparing N meals • Each chef prepares N/P meals • As N increases, also increase P, limited by constraints
Task parallel • Perform distinct computations at the same time • Number of tasks typically fixed • Not scalable • Example • Chef for salad, chef for dessert, chef appetizer • There are dependencies among tasks • Utilizes pipelining • Hybred of data and task is often used
Pseudo code – Peril-L Minimal, easy to learn Universal to any language Allow reasoning about performance Will extend C
Perl-L • Threads • forall (i in (1..12)) printf(“Hello %i\n”,i); • Prints 12 Hello’s in random order • Threads compete and execute in parallel
Perl-L • exclusive • One thread executes body at a time forall (i in (1..12)){ exclusive { printf(“Hello %i\n”,i); }} • barrier • Forces all threads to stop at the barrier until all threads arrive at which point they continue
Perl-L • barrier • All threads wait for all to arrive, then continue forall (i in (1..12)) { printf(“tweedle dee \n”); barrier; printf(“tweedle dum \n”); } • All tweedle dee’s print before tweedle dum’s
Peril-l memory model • Global • Variables visible to all threads • Outside a forall • Variables underlined • Local • Variables visible to only local thread • Inside a forall • Variables not underlined
Perl-l • Multiple reads concurrent • One write • Allows race conditions, last write wins
Connecting global and local memory • Global memory is distributed to local memory • Localize takes global memory to make it local intallData[n]; // global forall (thdID in (0..P-1)) { // spawn threads int size = n/P; // size of allocations intlocData[size]=localize(allData[]); // map globals to this thd locals
Connecting global and local memory (cont) Modification to local data is same as modifying global data but with out λ delay of accessing nonlocal memory
Issues of localization of global memory Global arrays use local indices which start at 0 Multiple threads on a processor keep data local to the thread There is no local copy, both local and global reference the same memory location?
Handy functions • size = mySize(global,i) • Feturns the size of the ith dimension of the local portion of the global array • localToGlobal(locData, i, j) • Returns global index corresponds to ith index of the jth dimension of the local array, locData
Full Empty variables - synchronization Like matter, next slide Incurs over head like global memory, λ int t’=0; //declare empty t and fill it
Reduce/Scan • Reduce – combines a set of values to produce a single value • Written with / • +/count //add elements of count • Scan – parallel prefix computation, embodies logic that performs a sequential operation in parts and carries along the intermediate results • Written with \ • Min\items //scan, ie find smallest of items’ prefixs
Additional example least = min/dataArray; //scalar stored in local //least of each thread. reduce/scan can combine values across multiple threads
More examples - reduce • count – local in each thread total=+/count; • Combined into a single result stored in each thread
More examples - scan count local to each thread beforeMe =+\count; count variables are accumulate so the ith thread has its beforeMe variable assigned the sum of the first i count values
Implied Reduce - Scan synchronization Consider largest = max/localTotal; All threads must arrive at this statement to perform the summation. Threads proceed only after the assignment
Programming consideration exclusive { total +=priv_count; } //done serially Versus Total =+/priv_count; //done with tree structure Converts from O(p) to O(lg P)
Figure 4.1 The Count 3s computation (Try 3) written in the Peril-L notation.
Formulating Parallelism • Fixed Parallelism • Write code designed for a particular machine • Improving the machine may not increase parallelism • Unlimited Parallelism • Use forall ( i in (0 .. n-1) • Will use available resources • Will require substantial thread communication
Formulating Parallelism (cont) • Scalable • As follows: • Determine how components (data structures, work load, etc) grow as n increases. • Formulate a set S of substantial subproblems where natural units of the solution are assigned to each S • Solve each S independently • Utilizes locality
Figure 4.3 Scalable Parallelism solution to Count 3s. Notice that the array segment has been localized.
Figure 4.4 Odd/Even Interchange to alphabetize a list L of records on field x.
Figure 4.5 Fixed 26-way parallel solution to alphabetizing. The function letRank(x) returns the 0-origin rank of the Latin letter x.
Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L.
Figure 4.7 Peril-L program using Batcher’s sort to alphabetize records in L. (cont.)