500 likes | 606 Views
Selections from CSE332 Data Abstractions course at University of Washington. Introduction to Multithreading & Fork-Join Parallelism hashtable , vector sum Analysis of Fork-Join Parallel Programs Maps, vector addition, linked lists versus trees for parallelism, work and span
E N D
Selections from CSE332 Data Abstractions course at University of Washington • Introduction to Multithreading & Fork-Join Parallelism • hashtable, vector sum • Analysis of Fork-Join Parallel Programs • Maps, vector addition, linked lists versus trees for parallelism, work and span • Parallel Prefix, Pack, and Sorting • Shared-Memory Concurrency & Mutual Exclusion • Concurrent bank account , OpenMP nested locks, OpenMP critical section • Programming with Locks and Critical Sections • Simple minded concurrent stack, hashtable revisited • Data Races and Memory Reordering, Deadlock, Reader/Writer Locks, Condition Variables
A Sophomoric Introduction to Shared-Memory Parallelism and ConcurrencyLecture 1Introduction to Multithreading & Fork-Join Parallelism Dan Grossman Last Updated: August 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
What to do with multiple processors? • Next computer you buy will likely have 4 processors • Wait a few years and it will be 8, 16, 32, … • The chip companies have decided to do this (not a “law”) • What can you do with them? • Run multiple totally different programs at the same time • Already do that? Yes, but with time-slicing • Do multiple things at once in one program • Our focus – more difficult • Requires rethinking everything from asymptotic complexity to how to implement data-structure operations Sophomoric Parallelism and Concurrency, Lecture 1
Parallelism vs. Concurrency Concurrency: Correctly and efficiently manage access to shared resources Parallelism: Use extra resources to solve a problem faster requests work resource resources There is some connection: • Common to use threads for both • If parallel computations need access to shared resources, then the concurrency needs to be managed First 3ish lectures on parallelism, then 3ish lectures on concurrency Note: Terms not yet standard but the perspective is essential • Many programmers confuse these concepts Sophomoric Parallelism and Concurrency, Lecture 1
An analogy CS1 idea: A program is like a recipe for a cook • One cook who does one thing at a time! (Sequential) Parallelism: • Have lots of potatoes to slice? • Hire helpers, hand out potatoes and knives • But too many chefs and you spend all your time coordinating Concurrency: • Lots of cooks making different things, but only 4 stove burners • Want to allow access to all 4 burners, but not cause spills or incorrect burner settings Sophomoric Parallelism and Concurrency, Lecture 1
Parallelism Example Sum(intarr[], intlen) { int res[4]; FORALL(i=0; i < 4; i++) { //parallel iterations res[i] = sumRange(arr,i*len/4,(i+1)*len/4); } return res[0]+res[1]+res[2]+res[3]; } intsumRange(intarr[], intlo, inthi) { result = 0; • for(j=lo; j < hi; j++) • result += arr[j]; • return result; • } Parallelism: Use extra computational resources to solve a problem faster (increasing throughput via simultaneous execution) Pseudocodefor array sum • Bad style for reasons we’ll see, but may get roughly 4x speedup Sophomoric Parallelism and Concurrency, Lecture 1
Concurrency Example classHashtable<K,V> { … void insert(K key, V value) { intbucket = …; prevent-other-inserts/lookups in table[bucket] do the insertion re-enable access to arr[bucket] } V lookup(K key) { (like insert, but can allow concurrent lookups to same bucket) } } Concurrency: Correctly and efficiently manage access to shared resources (from multiple possibly-simultaneous clients) Pseudocode for a shared chaining hashtable • Prevent bad interleavings(correctness) • But allow some concurrent access (performance) Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory The model we will assume is shared memory with explicit threads Old story: A running program has • One call stack (with each stack frameholding local variables) • One program counter(current statement executing) • Static fields • Objects (created by new) in the heap (nothing to do with heap data structure) New story: • A set of threads, each with its own call stack & program counter • No access to another thread’s local variables • Threads can (implicitly) share static fields / objects • To communicate, write somewhere another thread reads Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory • Threads each have own unshared call stack and current statement • (pc for “program counter”) • local variables are numbers, null, or heap references • Any objects can be shared, but most are not pc=… Unshared: locals and control Shared: objects and static fields … pc=… pc=… … … … Sophomoric Parallelism and Concurrency, Lecture 1
First attempt, Sum class - serial class Sum { intans; intlen; int *arr; public: Sum (int a[], int num) { //constructor arr=a, ans = 0; for (inti=0; i < num; ++i) { arr[i] = i+1; //initialize array } } int sum(int lo, int hi) { intans = 0; for(inti=lo; i < hi; ++i) { ans += arr[i]; } returnans; } } Sophomoric Parallelism and Concurrency, Lecture 1
Parallelism idea • Example: Sum elements of a large array • Idea Have 4 threadssimultaneously sum 1/4 of the array • Warning: Inferior first approach – explicitly using fixed number of threads ans0 ans1 ans2 ans3 + ans • Create 4 threads using openMP parallel • Determine thread ID, assign work based on thread ID • Accumulate partial sums for each thread • Add together their 4 answers for the final result Sophomoric Parallelism and Concurrency, Lecture 1
OpenMP basics First learn some basics of OpenMP • Pragma based approach to parallelism • Create pool of threads using #pragmaomp parallel • Use work sharing directives to allow each thread to do work • To get started, we will explore these worksharing directives: • parallel for • reduction • single • task • taskwait Most of these constructs have an analog in Intel® Cilk plus™ as well Sophomoric Parallelism and Concurrency, Lecture 1
Demo Sophomoric Parallelism and Concurrency, Lecture 1 Walk through of parallel sum code – SPMD version Discussion points: • #pragmaomp parallel • num_threadsclause
Demo Walk through of OpenMP parallel for with reduction code Discussion points: • #pragmaomp parallel for • reduction(+:ans) clause Sophomoric Parallelism and Concurrency, Lecture 1
Shared memory? • Fork-join programs (thankfully) don’t require much focus on sharing memory among threads • But in languages like C++, there is memory being shared. In our example: • lo, hi, arr fields written by “main” thread, read by helper thread • ans field written by helper thread, read by “main” thread • When using shared memory, you must avoid race conditions • With concurrency, we’ll learn other ways to synchronize later such as the use of atomics, locks, critical sections Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach // numThreads == numProcessors is bad // if some are needed for other things int sum(intarr[], intnumThreads){ … } Want to use (only) processors “available to you now” • Not used by other programs or threads in your program • Maybe caller is also using parallelism • Available cores can change even while your threads run • If you have 3 processors available and using 3 threads would take time X, then creating 4 threads would take time 1.5X Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach Though unlikely for sum, in general sub-problems may take significantly different amounts of time • Example: Apply method f to every array element, but maybe f is much slower for some data items • Example: Is a large integer prime? • If we create 4 threads and all the slow data is processed by 1 of them, we won’t get nearly a 4x speedup • Example of a load imbalance Sophomoric Parallelism and Concurrency, Lecture 1
Divide and Conquer Approach ans0 ans1 … ansN ans • Forward-portable: Lots of helpers each doing a small piece • Processors available: Hand out “work chunks” as you go • If 3 processors available and have 100 threads, then ignoring constant-factor overheads, extra time is < 3% • Load imbalance: No problem if slow thread scheduled early enough • Variation probably small anyway if pieces of work are small The counterintuitive (?) solution to all these problems is to use lots of threads, far more than the number of processors • But this will require changing our algorithm Sophomoric Parallelism and Concurrency, Lecture 1
Divide-and-Conquer idea + + + + + + + + + + + + + + + Sophomoric Parallelism and Concurrency, Lecture 1 This is straightforward to implement using divide-and-conquer • Parallelism for the recursive calls
Demo Walk through of Divide and Conquer code Discussion points: • #pragmaomp task • omp shared • ompfirstprivate • omptaskwait Sophomoric Parallelism and Concurrency, Lecture 1
Demo Walk through of load balanced OpenMP parallel for Discussion points: • Omp scheduling(dynamic) Sophomoric Parallelism and Concurrency, Lecture 1
A Sophomoric Introduction to Shared-Memory Parallelism and ConcurrencyLecture 2Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: August 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
Even easier: Maps (Data Parallelism) intvector_add(intlo, int hi){ FORALL(i=lo; i < hi; i++) { result[i] = arr1[i] + arr2[i]; } return 1; } • A map operates on each element of a collection independently to create a new collection of the same size • No combining results • For arrays, this is so trivial some hardware has direct support • Canonical example: Vector addition Sophomoric Parallelism and Concurrency, Lecture 2
Maps in ForkJoin Framework classVecAdd { int *arr1, *arr2, *res, len; public: VecAdd(int _arr1[],int _arr2[],int _res[],int _num) { arr1 = _arr1; arr2 = _arr2; res = _res; len = _num; } intvector_add(int lo, int hi) { #pragmaomp parallel for for( inti=lo; i < hi; ++i) { res[i] = arr1[i] + arr2[i]; } return 1; } } //end class definition • Even though there is no result-combining, it still helps with load balancing to create many small tasks • Maybe not for vector-add but for more compute-intensive maps • The forking is O(log n) whereas theoretically other approaches to vector-add is O(1) Sophomoric Parallelism and Concurrency, Lecture 2
Maps and reductions Maps and reductions: the “workhorses” of parallel programming • By far the two most important and common patterns • Two more-advanced patterns in next lecture • Learn to recognize when an algorithm can be written in terms of maps and reductions • Use maps and reductions to describe (parallel) algorithms • Programming them becomes “trivial” with a little practice • Exactly like sequential for-loops seem second-nature Sophomoric Parallelism and Concurrency, Lecture 2
Trees • Maps and reductions work just fine on balanced trees • Divide-and-conquer each child rather than array subranges • Correct for unbalanced trees, but won’t get much speed-up • Example: minimum element in an unsorted but balanced binary tree in O(logn) time given enough processors • How to do the sequential cut-off? • Store number-of-descendants at each node (easy to maintain) • Or could approximate it with, e.g., AVL-tree height Sophomoric Parallelism and Concurrency, Lecture 2
b c d e f front back Linked lists • Once again, data structures matter! • For parallelism, balanced trees generally better than lists so that we can get to all the data exponentially faster O(logn) vs. O(n) • Trees have the same flexibility as lists compared to arrays • Can you parallelize maps or reduces over linked lists? • Example: Increment all elements of a linked list • Example: Sum all elements of a linked list Sophomoric Parallelism and Concurrency, Lecture 2
Work and Span Let TP be the running time if there are P processors available Two key measures of run-time: • Work: How long it would take 1 processor = T1 • Just “sequentialize” the recursive forking • Span: How long it would take infinity processors = T • The longest dependence-chain • Example: O(logn) for summing an array since > n/2 processors is no additional help • Also called “critical path length” or “computational depth” Sophomoric Parallelism and Concurrency, Lecture 2
A Sophomoric Introduction to Shared-Memory Parallelism and ConcurrencyLecture 4Shared-Memory Concurrency & Mutual Exclusion Dan Grossman Last Updated: May 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
Canonical Bank Account Example classBankAccount { privateintbalance = 0; intgetBalance() { return balance; } void setBalance(intx) { balance = x; } void withdraw(intamount) { intb = getBalance(); if(amount > b) try {throw WithdrawTooLargeException;} setBalance(b – amount); } … // other operations like deposit, etc. } Correct code in a single-threaded world Sophomoric Parallelism & Concurrency, Lecture 4
A bad interleaving Thread 2 Thread 1 intb = getBalance(); if(amount > b) try{throw …;} setBalance(b – amount); intb = getBalance(); if(amount > b) try{throw …;} setBalance(b – amount); Time “Lost withdraw” – unhappy bank Interleaved withdraw(100) calls on the same account • Assume initial balance== 150 Sophomoric Parallelism & Concurrency, Lecture 4
What we need – an abstract data type for mutual exclusion • There are many ways out of this conundrum, but we need help from the language • One basic solution: Locks, or critical sections • OpenMP implements locks as well as critical sections. They do similar things, but have subtle differences between them • #pragmaomp critical { // allows only one thread access at a time in the code block // code block must have one entrance and one exit } • omp_lock_tmyLock; • omp locks have to be initialized before use. • Can be locked with omp_set_nest_lock, and unlocked with omp_unset_nest_lock • Only one thread may hold the lock • Allows for exception handling and non structured jumps Sophomoric Parallelism & Concurrency, Lecture 4
Almost-correct pseudocode • classBankAccount { • privateintbalance = 0; • private Lock lk = new Lock(); • … • void withdraw(intamount) { • lk.acquire(); /* may block */ • intb = getBalance(); • if(amount > b) • try {throw WithdrawTooLargeException;} • setBalance(b – amount); • lk.release(); • } • // deposit would also acquire/release lk • } Sophomoric Parallelism & Concurrency, Lecture 4
Some mistakes if(amount > b) { lk.release(); // hard to remember! try {throw WithdrawTooLargeException;} } • A lock & critical sections are very primitive mechanisms • Still up to you to use correctly to implement them • Incorrect: Use different locks or different named ciritical sections for withdraw and deposit • Mutual exclusion works only when using same lock • Poor performance: Use same lock for every bank account • No simultaneous operations on different accounts • Incorrect: Forget to release a lock (blocks other threads forever!) • Previous slide is wrong because of the exception possibility! Sophomoric Parallelism & Concurrency, Lecture 4
Other operations • If withdraw and deposit use the same lock, then simultaneous calls to these methods are properly synchronized • But what about getBalance and setBalance? • Assume they’re public, which may be reasonable • If they don’t acquire the same lock, then a race between setBalance and withdraw could produce a wrong result • If they do acquire the same lock, then withdraw would block forever because it tries to acquire a lock it already has Sophomoric Parallelism & Concurrency, Lecture 4
Re-acquiring locks? intsetBalance1(intx) { • balance = x; • } • intsetBalance2(intx) { • lk.acquire(); • balance = x; • lk.release(); • } • voidwithdraw(intamount) { lk.acquire(); … setBalanceX(b – amount); lk.release(); } Can’t let outside world call setBalance1, not protected by locks Can’t have withdraw call setBalance2, because the locks would be nested We can use intricate re-entrant locking scheme or better yet re-structure the code. Nested locking is not recommended Sophomoric Parallelism & Concurrency, Lecture 4
This code is easier to lock • intsetBalance(intx) { • lk.acquire(); • balance = x; • lk.release(); • } • void withdraw(int amount) { lk.acquire(); … balance = (b – amount); lk.release(); } You may provide a setBalance() method for external use, but do NOT call it for the withdraw() member function. Instead, protect the direct modification of the data as shown here – avoiding nested locks Sophomoric Parallelism & Concurrency, Lecture 4
Demo Walk through of load balanced BankAccount code Discussion points: • #pragmaomp critical Sophomoric Parallelism and Concurrency, Lecture 1
OpenMp Critical Section method Works great & is simple to implement Can’t be used as nested function calls where each function sets a critical section. For example foo() contains critical section which calls bar(), while bar() also contains critical section. This will not be allowed by OpenMP runtime! Generally better to avoid nested calls such as foo,bar example above, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance() Can’t be used with try/catch exception handling (due to multiple exists from code block) If try/catch is required, consider using scoped locking – example to follow. Sophomoric Parallelism & Concurrency, Lecture 4
Scoped Locking method • Works great & is fairly simple to implement • More complicated than critical section method • Can be used as nested function calls where each function acquires its lock. • Generally it is still better to avoid nested calls such as foo,bar example on previous foil, and as also shown in the Canonical Bank Account Example where withdraw() calls setBalance() • Can be used with try/catch exception handling • The lock is released when the object or function goes out of scope. Sophomoric Parallelism & Concurrency, Lecture 4
Demo Walk through of load balanced BankAccount code Discussion points: • scoped locking using OpenMP nested locks Sophomoric Parallelism and Concurrency, Lecture 1
A Sophomoric Introduction to Shared-Memory Parallelism and ConcurrencyLecture 5 Programming with Locks and Critical Sections Dan Grossman Last Updated: May 2011 For more information, see http://www.cs.washington.edu/homes/djg/teachingMaterials/
Demo Walk through of load balanced stack code Discussion points: • using OpenMP nested locks • Using OpenMP critical section Sophomoric Parallelism and Concurrency, Lecture 1
Example, using critical sections • classStack<E> { private E[] array = (E[])new Object[SIZE]; int index = -1; • boolisEmpty() { // unsynchronized: wrong?! • return index==-1; • } • void push(E val) { • #pragmaomp critical • {array[++index] = val;} } • E pop() { • E temp; • #pragmaomp critical • {temp = array[index--];} • return temp; • } • E peek() { // unsynchronized: wrong! • return array[index]; • } } Sophomoric Parallelism & Concurrency, Lecture 5
Why wrong? • It looks likeisEmpty and peek can “get away with this” since push and pop adjust the state “in one tiny step” • But this code is still wrong and depends on language-implementation details you cannot assume • Even “tiny steps” may require multiple steps in the implementation: array[++index] = valprobably takes at least two steps • Code has a data race, allowing very strange behavior • Important discussion in next lecture • Moral: Don’t introduce a data race, even if every interleaving you can think of is correct Sophomoric Parallelism & Concurrency, Lecture 5
3 choices need synchronization thread-local memory all memory immutable memory For every memory location (e.g., object field) in your program, you must obey at least one of the following: Thread-local: Don’t use the location in > 1 thread Immutable: Don’t write to the memory location Synchronized: Use synchronization to control access to the location Sophomoric Parallelism & Concurrency, Lecture 5
Example Papa Bear’s critical section was too long (table locked during expensive call) • #pragmaomp critical • { • v1 = table.lookup(k); • v2 = expensive(v1); • table.remove(k); • table.insert(k,v2); • } Suppose we want to change the value for a key in a hashtable without removing it from the table • Assume critical section guards the whole table Sophomoric Parallelism & Concurrency, Lecture 5
Example Mama Bear’s critical section was too short (if another thread updated the entry, we will lose an update) • #pragmaomp critical • { • v1 = table.lookup(k); • } • v2 = expensive(v1); • #pragmaomp critical • { • table.remove(k); • table.insert(k,v2); • } Suppose we want to change the value for a key in a hashtable without removing it from the table • Assume critical section guards the whole table Sophomoric Parallelism & Concurrency, Lecture 5
Example • done = false; • while(!done) { • #pragmaomp critical • { • v1 = table.lookup(k); • } • v2 = expensive(v1); • #pragmaomp critical • { • if(table.lookup(k)==v1) { • done = true; • table.remove(k); • table.insert(k,v2); • }}} Baby Bear’s critical section was just right (if another update occurred, try our update again) Suppose we want to change the value for a key in a hashtable without removing it from the table • Assume critical section guards the whole table Sophomoric Parallelism & Concurrency, Lecture 5
Don’t roll your own • It is rare that you should write your own data structure • Provided in standard libraries • Point of these lectures is to understand the key trade-offs and abstractions • Especially true for concurrent data structures • Far too difficult to provide fine-grained synchronization without race conditions • Standard thread-safe libraries like TBB ConcurrentHashMap written by world experts Guideline #5: Use built-in libraries whenever they meet your needs Sophomoric Parallelism & Concurrency, Lecture 5