240 likes | 423 Views
The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall). Alistair Dundas Department of Computer Science University of Massachusetts. Outline. What is Cilk? Cilk example: the Fibonacci algorithm. The work-first principle. Work Stealing. The T.H.E. Protocol.
E N D
The Implementation of the Cilk-5 Multithreaded Language(Frigo, Leiserson, and Randall) Alistair Dundas Department of Computer Science University of Massachusetts
Outline • What is Cilk? • Cilk example: the Fibonacci algorithm. • The work-first principle. • Work Stealing. • The T.H.E. Protocol. • Empirical results. • Summary and questions.
What Is Cilk? • Extension of C for parallel programming. • Designed for SMP machines with support for shared memory. • Benefits: • Provably efficient work stealing scheduler. • Clean programming model. • Benefits over normal thread programming: discussion topic! • Specifically: Source to source compiler generating C.
Example: Fibonacci Algorithm int main (int argc, char *argv[]){ int n, result; n = atoi(argv[1]); result = fib(n); printf(“Result:%d\n”, result); return 0;} int fib (int n){ if (n<2) return n; else { int x, y; x = fib (n-1); y = fib (n-2); return (x+y); }}
Example: Fibonacci In Parallel cilk int main (int argc, char *argv[]){ int n, result; n = atoi(argv[1]); result = spawn fib(n); sync; printf(“Result:%d\n”, result); return 0;} cilk int fib (int n){ if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); }}
The Work First Principle • Work is the amount of time needed to execute the computation serially. • Critical path length is the execution time on an infinite number of processors. • The Work-First Principle: Minimize scheduling overhead borne by work at the expense of increasing the critical path.
Theory: The Work First Principle • Where TP is the time on P processors: • TP = T1/P + O(T) (1) • Making critical path overhead explicit: • TP <= T1/P + cT (2) • Define average parallelism (max speedup): • PAVERAGE = T1/T • Define parallel slackness: • PAVERAGE/P
The Work First Principle (cont) • Assumption of parallel slackness: • PAVERAGE/P ≫ c • Combining these with the inequality, we get: • TP≈ T1/P • Define work overhead: • c1 = T1/TS • TP≈ c1TS/P • Conclusion: Minimize work overhead.
Work Stealing Algorithm • Each worker keeps a ready deque (double ended queue) of procedure instances waiting to run. • Workers treat the deque as a stack, pushing and popping procedure calls on to the end. • When workers have no more work, they steal from the front of another workers’ deque. • Parents are stolen before children. • This is implemented using two versions of each procedure: a fast clone, and a slow clone.
Fast Clone • Run fast clone when a procedure is spawned. • Little support for parallelism. • Whenever a call is made, save complete state, and push on to end of deque. • When call returns, check to see if procedure was stolen. • If stolen, return immediately. • If not stolen, carry on execution. • Since children are never stolen, sync is a no op.
Fast Clone Example cilk int fib (int n){ if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); }}
Fast Clone Example 1 int fib (int n) 2 { 3 fib.frame *f; frame pointer 4 f = alloc(sizeof(*f)); allocate frame 5 f->sig = fib.sig; initialize frame 6 if (n!2) { 7 free(f, sizeof(*f)); free frame 8 return n; 9 } 10 else { … }
Fast Clone Example 11 int x, y; 12 f->entry = 1; save PC 13 f->n = n; save live vars 14 *T = f; store frame pointer 15 push(); push frame 16 x = fib (n-1); do C call 17 if (pop(x) == FAILURE) pop frame 18 return 0; procedure stolen 19 < … > second spawn 20 ; sync is free! 21 free(f, sizeof(*f)); free frame 22 return (x+y); 23 } }
Slow Clone • Slow clone used when a procedure is stolen. • Similar to fast clone, but also supports concurrent execution. • It restores program counter and procedure state using copy stored on deque. • Calling sync makes call to runtime system for check on children’s status.
The T.H.E. Protocol • Deques held in shared memory. • Workers operate at the end, thiefs at the front. • We must prevent race conditions where a thief and victim try to access the same procedure frame. • Locking deques would be expensive for workers. • The T.H.E Protocol removes overhead of the common case, where there is no conflict.
The T.H.E. Protocol • Assumes only reads and writes are atomic. • Head of the deque is H, tail is T, and (T ≥ H) • Only thief can change H. • Only worker can change T. • To steal thiefs must get the lock L. • At most two processors operating on deque. • Three cases of interaction: • Two or more items on deque – each gets one. • One item on deque – either worker or thief gets frame, but not both. • No items on deque – both worker and thief fail.
One item on deque case • Both thief and worker assume they can get a procedure frame and change H or T. • If both thief and worker try to steal frame, one or both of them will discover (H > T), depending on instruction order. • If thief discovers (H > T) it backs off and restores H. • If worker discovers (H > T) it restores T, and then tries for the lock. Inside lock, procedure can be safely popped if still there.
Empirical Results • On an eight processor Sun SMP, achieved average speed up of 6.2 from elison (serial C non-threaded versions). • Assumptions of work-first seem sound: • Applications tested all showed high amounts of “average parallelism”. • Work overhead small for most programs. Least speed up is where overhead is greatest.
Summary • Extension of C for parallel programming. • Aims to simplify parallelization. • Main idea is to prevent overhead for workers rather than focus on critical path. • Empirical results show speed up average of 6.2 on an 8 processor machine.
My Questions • A cilk spawn is always just a C call. Who starts the threads, and how many are there? • Why use Cilk rather than use threads directly? • What about using Cilk on a bewoulf cluster? • Are their test programs representative of SMP applications?
Other Extentions • Inlets – a wrapper around spawned procedure returns. • Abort – terminates work no longer needed (e.g. in parallel search). • Locking facilities for access to shared data.
T.H.E. Protocol: The Worker/Victim pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS;} push() { T++; } steal() { lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS;}