440 likes | 460 Views
CILK: An Efficient Multithreaded Runtime System. People. Project at MIT & now at UT Austin Bobby Blumofe (now UT Austin, Akamai) Chris Joerg Brad Kuszmaul (now Yale) Charles Leiserson (MIT, Akamai) Keith Randall (Bell Labs) Yuli Zhou (Bell Labs). Outline. Introduction
E N D
People • Project at MIT & now at UT Austin • Bobby Blumofe (now UT Austin, Akamai) • Chris Joerg • Brad Kuszmaul (now Yale) • Charles Leiserson (MIT, Akamai) • Keith Randall (Bell Labs) • Yuli Zhou (Bell Labs)
Outline • Introduction • Programming environment • The work-stealing thread scheduler • Performance of applications • Modeling performance • Proven Properties • Conclusions
Introduction • Why multithreading? To implement dynamic, asynchronous, concurrent programs. • Cilk programmer optimizes: • total work • critical path • A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)
Introduction ... • Cilk program is a set of procedures • A procedureis a sequence of threads • Cilk threads are: • represented by nodes in the dag • Non-blocking: run to completion: nowaiting or suspension: atomic units of execution
Introduction ... • Threads can spawn child threads • downward edges connect a parent to its children • A child & parent can run concurrently. • Non-blocking threads a child cannot return a value to its parent. • The parent spawns a successor that receives values from its children
Introduction ... • A thread & its successor are parts of the same Cilk procedure. • connected by horizontal arcs • Children’s returned values are received before their successor begins: • They constitute data dependencies. • Connected by curved arcs
Introduction: Execution Time • Execution time of a Cilk program using P processors depends on: • Work (T1): time for Cilk program with 1 processor to complete. • Critical path (T): the time to execute the longest directed path in the dag. • TP >= T1 / P(not true for some searches) • TP >= T
Introduction: Scheduling • Cilk uses run time scheduling called work stealing. • Works well on dynamic, asynchronous, MIMD-style programs. • For “fully strict” programs, Cilk achieves asymptotic optimality for: space, time, & communication
Introduction: language • Cilk is an extension of C • Cilk programs are: • preprocessed to C • linked with a runtime library
Programming Environment • Declaring a thread: thread T ( <args> ) { <stmts> } • T is preprocessed into a C function of 1 argument and return type void. • The 1 argument is a pointer to a closure
Environment: Closure • A closureis a data structure that has: • a pointer to the C function for T • a slot for each argument (inputs & continuations) • a join counter: count of the missing argument values • A closure is ready when join counter == 0. • A closure is waiting otherwise. • They are allocated from a runtime heap
Environment: Continuation • A Cilk continuationis a data type, denoted by the keyword cont. cont int x; • It is a global reference to an empty slot of a closure. • It is implemented as 2 items: • a pointer to the closure; (what thread) • an int value: the slot number. (what input)
Environment: spawn • To spawn a child, a thread creates its closure: spawn T (<args> ) • creates child’s closure • sets available arguments • sets join counter • To specify a missing argument, prefix with a “?” spawn T (k, ?x);
Environment: spawn_next • A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) • Children typically have no missing arguments; successors do.
Explicit continuation passing • Nonblocking threads a parent cannot block on children’s results. • It spawns a successor thread. • This communication paradigm is called explicit continuation passing. • Cilk provides a primitive to send a value from one closure to another.
send_argument • Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn_next successor parent spawn send_argument child
Cilk Procedure for computing a Fibonacci number threadint fib( contint k, int n ) { if ( n < 2 )send_argument( k, n ); else {contint x, y; spawn_nextsum ( k, ?x, ?y ); spawnfib ( x, n - 1 ); spawnfib ( y, n - 2 ); } } thread sum ( contint k, int x, int y ) { send_argument( k, x + y ); }
Nonblocking Threads: Advantages • Shallow call stack. (for us: fault tolerance ) • Simplify runtime system: Completed threads leave C runtime stack empty. • Portable runtime implementation
Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.
Work-Stealing Scheduler • The concept of work-stealing goes at least as far back as 1981. • Work-stealing: • a process with no work selects a victim from which to get work. • it gets the shallowest thread in the victim’s spawn tree. • In Cilk, thieves choose victims randomly.
Stealing Work: The Ready Deque • Each closure has a level: • level( child ) = level( parent ) + 1 • level( successor ) = level( parent ) • Each processor maintains a readydeque: • Contains ready closures • The Lth element contains the list of all ready closures whose level is L.
Ready deque if ( ! readyDeque .isEmpty() ) take deepest thread else stealshallowest thread from readyDeque of randomly selected victim
Why Steal Shallowest closure? • Shallow threads probably produce more work, therefore, reducecommunication. • Shallow threads more likely to be on critical path.
Readying a Remote Closure • If a send_argument makes a remote closure ready, put closure on sending processor’s readyDeque • extra communication. • Done to make scheduler provably good • Putting on local readyDeque works well in practice.
Performance of Application • Tserial = time for C program • T1 = time for 1-processor Cilk program • Tserial /T1 = efficiency of the Cilk program • Efficiencyisclose to 1 for programs with moderately long threads: Cilk overhead is small.
Performance of Applications • T1/TP = speedup • T1/ T= average parallelism • If average parallelism is large then speedup is nearly perfect. • If average parallelism is small then speedup is much smaller.
Performance of Applications • Application speedup = efficiency X speedup • = ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP
Modeling Performance • TP >= max( T, T1 / P ) • A good scheduler should come close to these lower bounds.
Modeling Performance Empirical data suggests that for Cilk: TPc1 T1 / P + c T , where c1 1.067 & c 1.042 If T1 / T> 10P then critical path does not affect TP.
Proven Property: Time Time: Including overhead, TP = O( T1/P + T ), which is asymptotically optimal
Conclusions • We can predict the performance of a Cilk program by observing machine-independent characteristics: • Work • Critical path when the program is fully-strict. • Cilk’s usefulness is unclear for other kinds of programs (e.g., iterative programs).
Conclusions ... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).
Conclusions ... • Great system research has a theoretical underpinning. • Such research identifies important properties • of the systems themselves, or • of our ability to reason about them formally. • Cilk identified 3 significant system properties: • Fully strict programs • Non-blocking threads • Randomly choosing a victim.
The Cost of Spawns • A spawn is about an order of magnitude more costly than a C function call. • Spawned threads running on parent’s processor can be implemented more efficiently than remote spawns. • This usually is the case. • Compiler techniques can exploit this distinction.
Communication Efficiency • A request is an attempt to steal work (the victim may not have work). • Requests/processor & steals/processor both grow as the critical path grows.
Proven Properties: Space • A fully strict program’s threads send arguments only to its parent’s successors. • For such programs, space, time, & communication bounds are proven. • Space: SP <= S1 P. • There exists a P-processor execution for which this is asymptotically optimal.
Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T P SMAX ) where SMAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T P SMAX, for some constant, c.