1 / 44

CILK: An Efficient Multithreaded Runtime System

CILK: An Efficient Multithreaded Runtime System. People. Project at MIT & now at UT Austin Bobby Blumofe (now UT Austin, Akamai) Chris Joerg Brad Kuszmaul (now Yale) Charles Leiserson (MIT, Akamai) Keith Randall (Bell Labs) Yuli Zhou (Bell Labs). Outline. Introduction

oke
Download Presentation

CILK: An Efficient Multithreaded Runtime System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CILK: An Efficient Multithreaded Runtime System

  2. People • Project at MIT & now at UT Austin • Bobby Blumofe (now UT Austin, Akamai) • Chris Joerg • Brad Kuszmaul (now Yale) • Charles Leiserson (MIT, Akamai) • Keith Randall (Bell Labs) • Yuli Zhou (Bell Labs)

  3. Outline • Introduction • Programming environment • The work-stealing thread scheduler • Performance of applications • Modeling performance • Proven Properties • Conclusions

  4. Introduction • Why multithreading? To implement dynamic, asynchronous, concurrent programs. • Cilk programmer optimizes: • total work • critical path • A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

  5. Introduction ...

  6. Introduction ... • Cilk program is a set of procedures • A procedureis a sequence of threads • Cilk threads are: • represented by nodes in the dag • Non-blocking: run to completion: nowaiting or suspension: atomic units of execution

  7. Introduction ... • Threads can spawn child threads • downward edges connect a parent to its children • A child & parent can run concurrently. • Non-blocking threads  a child cannot return a value to its parent. • The parent spawns a successor that receives values from its children

  8. Introduction ... • A thread & its successor are parts of the same Cilk procedure. • connected by horizontal arcs • Children’s returned values are received before their successor begins: • They constitute data dependencies. • Connected by curved arcs

  9. Introduction ...

  10. Introduction: Execution Time • Execution time of a Cilk program using P processors depends on: • Work (T1): time for Cilk program with 1 processor to complete. • Critical path (T): the time to execute the longest directed path in the dag. • TP >= T1 / P(not true for some searches) • TP >= T

  11. Introduction: Scheduling • Cilk uses run time scheduling called work stealing. • Works well on dynamic, asynchronous, MIMD-style programs. • For “fully strict” programs, Cilk achieves asymptotic optimality for: space, time, & communication

  12. Introduction: language • Cilk is an extension of C • Cilk programs are: • preprocessed to C • linked with a runtime library

  13. Programming Environment • Declaring a thread: thread T ( <args> ) { <stmts> } • T is preprocessed into a C function of 1 argument and return type void. • The 1 argument is a pointer to a closure

  14. Environment: Closure • A closureis a data structure that has: • a pointer to the C function for T • a slot for each argument (inputs & continuations) • a join counter: count of the missing argument values • A closure is ready when join counter == 0. • A closure is waiting otherwise. • They are allocated from a runtime heap

  15. Environment: Continuation • A Cilk continuationis a data type, denoted by the keyword cont. cont int x; • It is a global reference to an empty slot of a closure. • It is implemented as 2 items: • a pointer to the closure; (what thread) • an int value: the slot number. (what input)

  16. Environment: Closure

  17. Environment: spawn • To spawn a child, a thread creates its closure: spawn T (<args> ) • creates child’s closure • sets available arguments • sets join counter • To specify a missing argument, prefix with a “?” spawn T (k, ?x);

  18. Environment: spawn_next • A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) • Children typically have no missing arguments; successors do.

  19. Explicit continuation passing • Nonblocking threads  a parent cannot block on children’s results. • It spawns a successor thread. • This communication paradigm is called explicit continuation passing. • Cilk provides a primitive to send a value from one closure to another.

  20. send_argument • Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn_next successor parent spawn send_argument child

  21. Cilk Procedure for computing a Fibonacci number threadint fib( contint k, int n ) { if ( n < 2 )send_argument( k, n ); else {contint x, y; spawn_nextsum ( k, ?x, ?y ); spawnfib ( x, n - 1 ); spawnfib ( y, n - 2 ); } } thread sum ( contint k, int x, int y ) { send_argument( k, x + y ); }

  22. Nonblocking Threads: Advantages • Shallow call stack. (for us: fault tolerance ) • Simplify runtime system: Completed threads leave C runtime stack empty. • Portable runtime implementation

  23. Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.

  24. Work-Stealing Scheduler • The concept of work-stealing goes at least as far back as 1981. • Work-stealing: • a process with no work selects a victim from which to get work. • it gets the shallowest thread in the victim’s spawn tree. • In Cilk, thieves choose victims randomly.

  25. Thread Level

  26. Stealing Work: The Ready Deque • Each closure has a level: • level( child ) = level( parent ) + 1 • level( successor ) = level( parent ) • Each processor maintains a readydeque: • Contains ready closures • The Lth element contains the list of all ready closures whose level is L.

  27. Ready deque if ( ! readyDeque .isEmpty() ) take deepest thread else stealshallowest thread from readyDeque of randomly selected victim

  28. Why Steal Shallowest closure? • Shallow threads probably produce more work, therefore, reducecommunication. • Shallow threads more likely to be on critical path.

  29. Readying a Remote Closure • If a send_argument makes a remote closure ready, put closure on sending processor’s readyDeque • extra communication. • Done to make scheduler provably good • Putting on local readyDeque works well in practice.

  30. Performance of Application • Tserial = time for C program • T1 = time for 1-processor Cilk program • Tserial /T1 = efficiency of the Cilk program • Efficiencyisclose to 1 for programs with moderately long threads: Cilk overhead is small.

  31. Performance of Applications • T1/TP = speedup • T1/ T= average parallelism • If average parallelism is large then speedup is nearly perfect. • If average parallelism is small then speedup is much smaller.

  32. Performance Data

  33. Performance of Applications • Application speedup = efficiency X speedup • = ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP

  34. Modeling Performance • TP >= max( T, T1 / P ) • A good scheduler should come close to these lower bounds.

  35. Modeling Performance Empirical data suggests that for Cilk: TPc1 T1 / P + c  T , where c1 1.067 & c  1.042 If T1 / T> 10P then critical path does not affect TP.

  36. Proven Property: Time Time: Including overhead, TP = O( T1/P + T ), which is asymptotically optimal

  37. Conclusions • We can predict the performance of a Cilk program by observing machine-independent characteristics: • Work • Critical path when the program is fully-strict. • Cilk’s usefulness is unclear for other kinds of programs (e.g., iterative programs).

  38. Conclusions ... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).

  39. Conclusions ... • Great system research has a theoretical underpinning. • Such research identifies important properties • of the systems themselves, or • of our ability to reason about them formally. • Cilk identified 3 significant system properties: • Fully strict programs • Non-blocking threads • Randomly choosing a victim.

  40. END

  41. The Cost of Spawns • A spawn is about an order of magnitude more costly than a C function call. • Spawned threads running on parent’s processor can be implemented more efficiently than remote spawns. • This usually is the case. • Compiler techniques can exploit this distinction.

  42. Communication Efficiency • A request is an attempt to steal work (the victim may not have work). • Requests/processor & steals/processor both grow as the critical path grows.

  43. Proven Properties: Space • A fully strict program’s threads send arguments only to its parent’s successors. • For such programs, space, time, & communication bounds are proven. • Space: SP <= S1 P. • There exists a P-processor execution for which this is asymptotically optimal.

  44. Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T P SMAX ) where SMAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T P SMAX, for some constant, c.

More Related