1.46k likes | 1.48k Views
Learn about multithreaded programming model using Cilk language with emphasis on load balancing and synchronization for parallel applications. Dive into dynamic behavior, Fibonacci computations, matrix multiplication, and critical path analysis.
E N D
Load Balancing and Multithreaded Programming Nir Shavit Multiprocessor Synchronization Spring 2003
How to write Parallel Apps? • Multithreaded Programming • Programming model • Programming language (Cilk) • Well-developed theory • Successful practice M. Herlihy & N. Shavit (c) 2003
Why We Care • Interesting in its own right • Scheduler • Ideal application for • Lock-free data structures M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} *Cilk Code (Java Code in Notes) M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Parallel method call M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Wait for children to complete M. Herlihy & N. Shavit (c) 2003
Multithreaded Fibonacci int fib(int n) { if (n < 2) { return n; } else { int x = spawn fib(n-1); int y = spawn fib(n-2); sync(); return x + y; }} Safe to use children’s values M. Herlihy & N. Shavit (c) 2003
Note • Spawn & synch operators • Like Israeli traffic signs • Are purely advisory in nature • The scheduler • Like the Israeli driver • Has complete freedom to decide M. Herlihy & N. Shavit (c) 2003
Dynamic Behavior • Multithreaded program is • A directed acyclic graph (DAG) • That unfolds dynamically • A thread is • Maximal sequence of instructions • Without spawn, sync, or return M. Herlihy & N. Shavit (c) 2003
fib(4) fib(2) sync spawn fib(3) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib DAG M. Herlihy & N. Shavit (c) 2003
fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Arrows Reflect Dependencies fib(4) sync spawn fib(3) fib(2) M. Herlihy & N. Shavit (c) 2003
How Parallel is That? • Define work: • Total time on one processor • Define critical-path length: • Longest dependency path • Can’t beat that! M. Herlihy & N. Shavit (c) 2003
fib(4) fib(3) fib(2) fib(2) fib(1) fib(1) fib(1) fib(1) fib(1) Fib Work M. Herlihy & N. Shavit (c) 2003
Fib Work 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 work is 17 16 17 M. Herlihy & N. Shavit (c) 2003
fib(4) Fib Critical Path M. Herlihy & N. Shavit (c) 2003
Fib Critical Path fib(4) 1 8 2 7 3 4 6 Critical path length is 8 5 M. Herlihy & N. Shavit (c) 2003
Notation Watch • TP = time on P processors • T1 = work (time on 1 processor) • T∞ = critical path length (time on ∞ processors) M. Herlihy & N. Shavit (c) 2003
Simple Bounds • TP ≥ T1/P • In one step, can’t do more than P work • TP ≥ T∞ • Can’t beat infinite resources M. Herlihy & N. Shavit (c) 2003
More Notation Watch • Speedup on P processors • Ratio T1/TP • How much faster with P processors • Linear speedup • T1/TP = Θ(P) • Max speedup (average parallelism) • T1/T∞ M. Herlihy & N. Shavit (c) 2003
Remarks • Graph nodes have out-degree ≤ 2 • Unique • Starting node • Ending node M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication M. Herlihy & N. Shavit (c) 2003
Matrix Multiplication • Each n-by-n matrix multiplication • 8 multiplications • 4 additions • Of n/2-by-n/2 submatrices M. Herlihy & N. Shavit (c) 2003
Addition int add(Matrix C, Matrix T, int n) { if (n == 1) { C[1,1] = C[1,1] + T[1,1]; } else { partition C, T into half-size submatrices; spawn add(C11,T11,n/2); spawn add(C12,T12,n/2); spawn add(C21,T21,n/2); spawn add(C22,T22,n/2) sync(); }} M. Herlihy & N. Shavit (c) 2003
Addition • Let AP(n) be running time • For n x n matrix • on P processors • For example • A1(n) is work • A∞(n) is critical path length M. Herlihy & N. Shavit (c) 2003
Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) Partition, synch, etc 4 spawned additions M. Herlihy & N. Shavit (c) 2003
Addition • Work is A1(n) = 4 A1(n/2) + Θ(1) = Θ(n2) Same as double-loop summation M. Herlihy & N. Shavit (c) 2003
Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) spawned additions in parallel Partition, synch, etc M. Herlihy & N. Shavit (c) 2003
Addition • Critical Path length is A∞(n) = A∞(n/2) + Θ(1) = Θ(log n) M. Herlihy & N. Shavit (c) 2003
Multiplication int mult(Matrix C, Matrix A, Matrix B, int n) { if (n == 1) { C[1,1] = A[1,1]·B[1,1]; } else { allocate temporary n·n matrix T; partition A,B,C,T into half-size submatrices; … M. Herlihy & N. Shavit (c) 2003
Multiplication (con’t) spawn mult(C11,A11,B11,n/2); spawn mult(C12,A11,B12,n/2); spawn mult(C21,A21,B11,n/2); spawn mult(C22,A22,B12,n/2) spawn mult(T11,A11,B21,n/2); spawn mult(T12,A12,B22,n/2); spawn mult(T21,A21,B21,n/2); spawn mult(T22,A22,B22,n/2) sync(); spawn add(C,T,n); }} M. Herlihy & N. Shavit (c) 2003
Multiplication • Work is M1(n) = 8 M1(n/2) + A1(n) Final addition 8 spawned mulitplications M. Herlihy & N. Shavit (c) 2003
Multiplication • Work is M1(n) = 8 M1(n/2) + Θ(n2) = Θ(n3) Same as serial triple-nested loop M. Herlihy & N. Shavit (c) 2003
Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) Final addition Half-size parallel multiplications M. Herlihy & N. Shavit (c) 2003
Multiplication • Critical path length is M∞(n) = M∞(n/2) + A∞(n) = M∞(n/2) + Θ(log n) = Θ(log2 n) M. Herlihy & N. Shavit (c) 2003
Parallelism • M1(n)/ M∞(n) = Θ(n3/log2 n) • To multiply two 1000 x 1000 matrices • 10003/102=107 • Much more than number of processors on any real machine M. Herlihy & N. Shavit (c) 2003
Shared-Memory Multiprocessors • Parallel applications • Java • Cilk, etc. • Mix of other jobs • All run together • Come & go dynamically M. Herlihy & N. Shavit (c) 2003
Scheduling • Ideally, • User-level scheduler • Maps threads to dedicated processors • In real life, • User-level scheduler • Maps threads to fixed number of processes • Kernel-level scheduler • Maps processes to dynamic pool of processors M. Herlihy & N. Shavit (c) 2003
For Example • Initially, • All P processors available for application • Serial computation • Takes over one processor • Leaving P-1 for us • Waits for I/O • We get that processor back …. M. Herlihy & N. Shavit (c) 2003
Speedup • Map threads onto P processes • Cannot get P-fold speedup • What if the kernel doesn’t cooperate? • Can try for PA-fold speedup • PA is time-averaged number of processors the kernel gives us M. Herlihy & N. Shavit (c) 2003
ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) Static Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 12 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003
ideal mm(1024) lu(2048) barnes(16K,10) heat(4K,512,100) msort(32M) ray() Dynamic Load Balancing 8 7 6 5 speedup 4 8-processor Sun Ultra Enterprise 5000. 3 2 1 1 4 8 8 12 12 16 16 20 24 28 32 processes M. Herlihy & N. Shavit (c) 2003
Scheduling Hierarchy • User-level scheduler • Tells kernel which processes are ready • Kernel-level scheduler • Synchronous (for analysis, not correctness!) • Picks pi threads to schedule at step i • Time-weighted average is: M. Herlihy & N. Shavit (c) 2003
Greed is Good • Greedy scheduler • Schedules as much as it can • At each time step M. Herlihy & N. Shavit (c) 2003
Theorem • Greedy scheduler ensures actual timeT ≤ T1/PA + T∞(P-1)/PA M. Herlihy & N. Shavit (c) 2003
Proof Strategy Bound this! M. Herlihy & N. Shavit (c) 2003
Put Tokens in Buckets Thread scheduled and executed Thread scheduled but not executed work idle M. Herlihy & N. Shavit (c) 2003
At the end …. Total #tokens = work idle M. Herlihy & N. Shavit (c) 2003
At the end …. T1 tokens work idle M. Herlihy & N. Shavit (c) 2003
Must Show ≤ T∞(P-1) tokens work idle M. Herlihy & N. Shavit (c) 2003
Every Move You Make … • Scheduler is greedy • At least one node ready • Number of idle threads in one step • At most pi-1 ≤ P-1 M. Herlihy & N. Shavit (c) 2003