1.46k likes | 1.48k Views
Load Balancing and Multithreaded Programming. Nir Shavit Multiprocessor Synchronization Spring 2003. How to write Parallel Apps?. Multithreaded Programming Programming model Programming language (Cilk) Well-developed theory Successful practice. Why We Care. Interesting in its own right
E N D
Load Balancing and Multithreaded Programming Nir Shavit Multiprocessor Synchronization Spring 2003
How to write Parallel Apps? • Multithreaded Programming • Programming model • Programming language (Cilk) • Well-developed theory • Successful practice M. Herlihy & N. Shavit (c) 2003
Why We Care • Interesting in its own right • Scheduler • Ideal application for • Lock-free data structures M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} *Cilk Code (Java Code in Notes) M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Parallel method call M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Wait for children to complete M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Safe to use children’s values M. Herlihy & N. Shavit (c) 2003
Note • Spawn & synch operators • Like Israeli traffic signs • Are purely advisory in nature • The scheduler • Like the Israeli driver • Has complete freedom to decide M. Herlihy & N. Shavit (c) 2003
Dynamic Behavior • Multithreaded program is • A directed acyclic graph (DAG) • That unfolds dynamically • A thread is • Maximal sequence of instructions • Without spawn, sync, or return M. Herlihy & N. Shavit (c) 2003
fib(4) fib(2) sync spawn fib(3) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib DAG M. Herlihy & N. Shavit (c) 2003
fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Arrows Reflect Dependencies fib(4) sync spawn fib(3) fib(2) M. Herlihy & N. Shavit (c) 2003
How Parallel is That? • Define work: • Total time on one processor • Define critical-path length: • Longest dependency path • Can’t beat that! M. Herlihy & N. Shavit (c) 2003
fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib Work M. Herlihy & N. Shavit (c) 2003
Fib Work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 work is 17 16 17 M. Herlihy & N. Shavit (c) 2003
fib(4) Fib Critical Path M. Herlihy & N. Shavit (c) 2003
Fib Critical Path fib(4) 1 8 2 7 3 4 6 Critical path length is 8 5 M. Herlihy & N. Shavit (c) 2003
Notation Watch • TP = time on P processors • T1 = work (time on 1 processor) • T∞ = critical path length (time on ∞ processors) M. Herlihy & N. Shavit (c) 2003
Simple Bounds • TP ≥ T1/P • In one step, can’t do more than P work • TP ≥ T∞ • Can’t beat infinite resources M. Herlihy & N. Shavit (c) 2003
More Notation Watch • Speedup on P processors • Ratio T1/TP • How much faster with P processors • Linear speedup • T1/TP = Θ(P) • Max speedup (average parallelism) • T1/T∞ M. Herlihy & N. Shavit (c) 2003
Remarks • Graph nodes have out-degree ≤ 2 • Unique • Starting node • Ending node M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication • Each n-by-n matrix multiplication • 8 multiplications • 4 additions • Of n/2-by-n/2 submatrices M. Herlihy & N. Shavit (c) 2003
Addition int add(Matrix C, Matrix T, int n) { if (n == 1) { C[1,1] = C[1,1] + T[1,1]; } else { partition C, T into half-size submatrices; spawn add(C11,T11,n/2); spawn add(C12,T12,n/2); spawn add(C21,T21,n/2); spawn add(C22,T22,n/2) sync(); }} M. Herlihy & N. Shavit (c) 2003
Addition • Let AP(n) be running time • For n x n matrix • on P processors • For example • A1(n) is work • A∞(n) is critical path length M. Herlihy & N. Shavit (c) 2003
Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) Partition, synch, etc 4 spawned additions M. Herlihy & N. Shavit (c) 2003
Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation M. Herlihy & N. Shavit (c) 2003
Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc M. Herlihy & N. Shavit (c) 2003
Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) M. Herlihy & N. Shavit (c) 2003
Multiplication int mult(Matrix C, Matrix A, Matrix B, int n) { if (n == 1) { C[1,1] = A[1,1]·B[1,1]; } else { allocate temporary n·n matrix T; partition A,B,C,T into half-size submatrices; … M. Herlihy & N. Shavit (c) 2003
Multiplication (con’t) spawn mult(C11,A11,B11,n/2); spawn mult(C12,A11,B12,n/2); spawn mult(C21,A21,B11,n/2); spawn mult(C22,A22,B12,n/2) spawn mult(T11,A11,B21,n/2); spawn mult(T12,A12,B22,n/2); spawn mult(T21,A21,B21,n/2); spawn mult(T22,A22,B22,n/2) sync(); spawn add(C,T,n); }} M. Herlihy & N. Shavit (c) 2003
Multiplication • Work is M1(n) = 8 M1(n/2) + A1(n) Final addition 8 spawned mulitplications M. Herlihy & N. Shavit (c) 2003
Multiplication • Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop M. Herlihy & N. Shavit (c) 2003
Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) Final addition Half-size parallel multiplications M. Herlihy & N. Shavit (c) 2003
Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) M. Herlihy & N. Shavit (c) 2003
Parallelism • M1(n)/ M∞(n) = Θ(n3/log2 n) • To multiply two 1000 x 1000 matrices • 10003/102=107 • Much more than number of processors on any real machine M. Herlihy & N. Shavit (c) 2003
Shared-Memory Multiprocessors • Parallel applications • Java • Cilk, etc. • Mix of other jobs • All run together • Come & go dynamically M. Herlihy & N. Shavit (c) 2003
Scheduling • Ideally, • User-level scheduler • Maps threads to dedicated processors • In real life, • User-level scheduler • Maps threads to fixed number of processes • Kernel-level scheduler • Maps processes to dynamic pool of processors M. Herlihy & N. Shavit (c) 2003
For Example • Initially, • All P processors available for application • Serial computation • Takes over one processor • Leaving P-1 for us • Waits for I/O • We get that processor back …. M. Herlihy & N. Shavit (c) 2003
Speedup • Map threads onto P processes • Cannot get P-fold speedup • What if the kernel doesn’t cooperate? • Can try for PA-fold speedup • PA is time-averaged number of processors the kernel gives us M. Herlihy & N. Shavit (c) 2003
ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) Static Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 12 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003
ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() Dynamic Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 8 12 12 16 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003
Scheduling Hierarchy • User-level scheduler • Tells kernel which processes are ready • Kernel-level scheduler • Synchronous (for analysis, not correctness!) • Picks pi threads to schedule at step i • Time-weighted average is: M. Herlihy & N. Shavit (c) 2003
Greed is Good • Greedy scheduler • Schedules as much as it can • At each time step M. Herlihy & N. Shavit (c) 2003
Theorem • Greedy scheduler ensures actual timeT ≤ T1/PA + T∞(P-1)/PA M. Herlihy & N. Shavit (c) 2003
Proof Strategy Bound this! M. Herlihy & N. Shavit (c) 2003
Put Tokens in Buckets Thread scheduled and executed Thread scheduled but not executed work idle M. Herlihy & N. Shavit (c) 2003
At the end …. Total #tokens = work idle M. Herlihy & N. Shavit (c) 2003
At the end …. T1 tokens work idle M. Herlihy & N. Shavit (c) 2003
Must Show ≤ T∞(P-1) tokens work idle M. Herlihy & N. Shavit (c) 2003
Every Move You Make … • Scheduler is greedy • At least one node ready • Number of idle threads in one step • At most pi-1 ≤ P-1 M. Herlihy & N. Shavit (c) 2003