390 likes | 636 Views
Pillar. Jim Stichnoth Programming Systems Lab, Intel. Outline. Overview of the Pillar project Details on the Pillar runtime. Motivation. Implementing & tuning each new concurrent language is a lot of work! Compiler Find concurrency opportunities Global optimizations
E N D
Pillar Jim StichnothProgramming Systems Lab, Intel
Outline • Overview of the Pillar project • Details on the Pillar runtime Pillar - Jim Stichnoth - 2008-12-01
Motivation • Implementing & tuning each new concurrent language is a lot of work! • Compiler • Find concurrency opportunities • Global optimizations • Register allocation, code scheduling, instruction selection • Runtime • Threading, synchronization, data parallelism, transactions, ... • Garbage collection, stack walking, exceptions, ... Pillar - Jim Stichnoth - 2008-12-01
Pillar • Parallel Implementation Language • Low-level implementation language for high-level concurrent & managed languages • Reusable compiler & runtime infrastructure across high-level languages Pillar - Jim Stichnoth - 2008-12-01
Pillar architecture(Very high level) Pillar compiler Pillar program Parallel program Compile-timetool chain Languagecompiler Pillar compiler Object code Run-timeexecutable Language runtime Pillar runtime High-level language compiler Pillar runtime Pillar - Jim Stichnoth - 2008-12-01
The Pillar language • A set of C language extensions • Heavily influenced by C-- work at Harvard & Microsoft Research • Highly practical reason: reuse existing optimizing C compiler • Concurrency constructs • Thread creation, synchronization, data-parallel operations • Sequential constructs • Support managed languages, fix other shortcomings of C Pillar - Jim Stichnoth - 2008-12-01
Concurrency constructs:Parallel call • pcall(aff) func(a, b, c); • Fork a new child thread & execute func, parent & child run concurrently • Affinity/locality hint via aff • Join can be implemented via shared synchronization object passed as a parameter Pillar - Jim Stichnoth - 2008-12-01
Concurrency constructs:Parallel-ready sequential call • prscall(aff) func(a, b, c); • Parallel-Ready Sequential Call (Goldstein’96) • Semantics identical to pcall • Parent thread starts eagerly executing child func • An idle thread can introduce concurrency by stealing parent’s continuation • Optimized for sequential execution Pillar - Jim Stichnoth - 2008-12-01
Concurrency constructs:Data parallelism • Intel’s Ct primitives • “C with throughput extensions” • Large set of nested data parallel primitives (a la NESL) • Compiler analysis & optimization • Future work Pillar - Jim Stichnoth - 2008-12-01
Concurrency constructs:Bulk-spawn • Efficiently spawn a number of threads with similar arguments • Useful for data-parallel operations • Join at the end • Future work Pillar - Jim Stichnoth - 2008-12-01
Concurrency constructs:Synchronization • Software transactions • Other common synchronization primitives Pillar - Jim Stichnoth - 2008-12-01
Sequential constructs:Stack walking • Pillar runtime provides a frame-by-frame iterator over a thread’s stack • Spans • span KEY value { ... } • Associate metadata with a block of code • Look up metadata during stack walking • Garbage collection • No specific GC implementation (or object model) is provided or dictated • ref obj; • New ref type allows compiler to track GC references in stack frame • Optional parameters to ref declaration allow for arbitrary language-defined reference variants • E.g. interior pointers, weak references, pinned objects, etc. Pillar - Jim Stichnoth - 2008-12-01
C’s setjmp/longjmp “done right” Roughly speaking, continuation=setjmp and cut=longjmp Directly in the language, not a library A cut to the target continuation may pass arguments Special source code annotations give the compiler extra control flow info Syntax continuation k(a, b, c):... foo(k); ... cut to k(x, y, z); foo() also cuts to k1, k2; foo() also unwinds to k3, k4; foo() never returns; Sequential constructs:Second-class continuations Pillar - Jim Stichnoth - 2008-12-01
Sequential constructs:Calls • Tail calls • tailcall foo(); • Particularly for compiling functional languages • Managed/unmanaged calls • Unmanaged (legacy) code uses calling conventions like __cdecl, __stdcall, etc. • Managed (Pillar) functions implicitly add the managed attribute • Compiler recognizes mismatches, redirects through Pillar runtime routine • Allows stack unwinding past sections of unmanaged frames • #pragma managed(off)#include <stdio.h>#pragma managed(on) Pillar - Jim Stichnoth - 2008-12-01
Cuts compose poorly with some operations Example: cutting out of a transaction A calls B B starts a transaction B calls C transactionally C cuts back into A Transaction was not ended! Many other examples Pillar’s solution: composable cuts See LCPC’2007 paper for more details Composable cuts function A(): ... B(k); ... continuation k: ... function B(k): ... txn_begin(); C(k); txn_end(); ... function C(k): ... cut to k; ... Pillar - Jim Stichnoth - 2008-12-01
Pillar compiler • Modification of Intel’s product compiler • Continuations: model additional control-flow edges, killing of callee-save registers during a cut • Recognize managed/unmanaged calls • GC support: track GC references through all compiler phases • Stack unwinding metadata: frame-by-frame unwinding, spans, GC roots • Implement Pillar runtime API to decode metadata at run time Pillar - Jim Stichnoth - 2008-12-01
Pillar runtime • Implements key Pillar services • Parallel calls, prscall continuation stealing, futures • Stack walking, root set enumeration • Composable cuts • Invokes Pillar compiler’s metadata decoder as necessary • Built on top of McRT (Intel’s “Many Core Run Time”) • Provides core services such as user-level threads, scheduling, synchronization, software transactional memory • Approximately 7,000 lines of C code • API is architecture-neutral except for machine word size & stack iterator’s set of registers Pillar - Jim Stichnoth - 2008-12-01
Pillar architecture Pillar compiler Pillar program Parallel program Compile-timetool chain Languageconverter Pillar compiler Metadatadecoder Object code& metadata Run-timeexecutable Language runtime Pillar runtime GCinterface Garbage collector McRT High-level language compiler Pillar runtime Pillar - Jim Stichnoth - 2008-12-01
High-level languages • Java • Main motivation: throw huge volume of Pillar code at the Pillar compiler and runtime • Exercises stack iteration, spans, GC support, second-class continuations, managed/unmanaged calls • X10 • Leverages IBM’s Java-based open-source reference implementation • Hard to study performance/scalability using reference implementation! • Concurrent functional language • Lots of concurrency due to limitations on side effects & dependencies • Exercises GC support, second-class continuations, tail calls • Implements a futures package using pcall Pillar - Jim Stichnoth - 2008-12-01
Outline • Overview of the Pillar project • Gory details of the Pillar runtime & architecture Pillar - Jim Stichnoth - 2008-12-01
Stack walking • Pillar runtime interface for iterating over a thread’s stack frames • Get youngest frame, get next frame, test for last frame • Getting next frame means simulating a function return • Access metadata associated with a stack frame • Look up span metadata • Enumerate the root set to the GC • How? • Code generator registers callback functions with Pillar runtime • Perform operations like: unwind one frame, look up span metadata • Associated with a code address range • Thus code generator defines its own flexible metadata format • Or, stack walking can be sped up by using a standard metadata format Pillar - Jim Stichnoth - 2008-12-01
Garbage collection support • Pillar language & runtime do not dictate an object model • This is a contract between the Pillar program (generated from the high-level language) and the GC implementation (provided with the HLL) • Minimal language/runtime support allows highly flexible range of GC implementations • Generalized references • ref(TAG, parameter) r; • Predefined tags • PrtGcTagDefault: r is the canonical object pointer • PrtGcTagBase: r is a (possibly interior) pointer with respect to parameter • PrtGcTagOffset: r is an interior pointer at a given offset from the canonical base • Other user-defined tags, e.g. weak roots, pinned, etc. • Predefined tags correspond to traditional compiler optimizations • Compiler metadata allows tag & parameter to be passed directly to the GC Pillar - Jim Stichnoth - 2008-12-01
Implementing exceptions • Iterate frame-by-frame using the stack unwinding interface • Use the span lookup interface to get handler metadata • Decide whether current frame handles the exception • Use “also unwinds to” metadata to get handler • Annotation on a function call (ordered list of continuations) • Causes compiler to produce metadata allowing lookup & instantiation of continuations during stack unwinding • Use the “cut to” mechanism to transfer control to the handler Pillar - Jim Stichnoth - 2008-12-01
Cuts & continuations • A continuation is like a C jmpbuf, allowing a unit-time cut back to somewhere in the continuation’s stack frame • Also allows arguments to be passed back • Structure in Pillar: • Code address (initialized lazily) • Optional argument buffer space • Cut operation is simple • Copy arguments into buffer space • Load continuation address into predetermined register • Jump to code address • Continuation code needs to fix up stack frame based on predetermined register value Pillar - Jim Stichnoth - 2008-12-01
Composable cuts • Each thread maintains lightweight virtual stack alongside regular stack • Virtual stack head (VSH) in TLS • A virtual stack element (VSE) defines a destructor operation • The application pushes/pops VSEs in a balanced fashion • When instantiating a continuation, also capture current VSH • Fat continuation contains code address, VSH, and arguments • During a cut operation, intervening destructors are executed virtual stack head stack grows Pillar - Jim Stichnoth - 2008-12-01
void foo(args) { int a; ref b; … bar(k1) also cuts to k1; … continuation k1(a, b): … continuation k2(b, a): … } Instantiated k1 within the method forces preserved registers to be saved in prolog Stack space allocated for continuations & locals Instantiating k1 initializes eip & vsh fields Eventually some method cuts to k1 Run any destructors based on TLS.vsh & k1.vsh (see earlier example) Continuation prolog adjusts esp, copies args Now ready to resume in k1’s code Cuts & continuations example Stack grows in this direction Low memory addresses esp High memory addresses in-args ret IP saved registers (ebp, ebx, esi, edi) a b k1 k2 k1 ret IP … … k1 … b’ vsh eip a’ vsh eip a’ b’ Warning: Be sure to view this in slide-show mode! uninitialized stack space Pillar - Jim Stichnoth - 2008-12-01 initialized stack space
Virtual stack notes • VSEs can be used as markers as well as destructors • E.g., the location of a prscall or managed-to-unmanaged transition • Markers have a trivial destructor • Pillar runtime interface for iterating over VSEs • A VSE may contain GC roots • Register an enumeration function for a VSE type Pillar - Jim Stichnoth - 2008-12-01
Stack limit check • Prscall continuation stealing results in two threads sharing one stack • Parent uses bottom part, child uses top part • With a long & dense enough chain of prscalls, the thread stack can become arbitrarily small • Therefore, each function prolog must begin with a limit check and conditional stack extension sequence • Special tailcall sequence to a stack-extension runtime routine that re-invokes function with a fresh stack • Observation: Every function begins with a yield check and a stack limit check. Can they be combined? • Suspending a thread installs a special stack limit value such that limit checks always fail • Explicit prolog yield check can be removed Pillar - Jim Stichnoth - 2008-12-01
Managed/unmanaged calls • When calling into unmanaged (legacy) code, it’s no longer possible to reliably walk the stack frame-by-frame • Solution: push a managed-to-unmanaged (M2U) VSE before calling • Record all relevant context in M2U VSE • When unwinding from an unmanaged frame, search the virtual stack for the topmost M2U VSE • Restore context from VSE • Resume unwinding managed frames Pillar - Jim Stichnoth - 2008-12-01
Thread-local storage • Pillar needs several TLS fields • Current stack limit value • Yield semaphore • Language-specific TLS pointer • E.g., nursery parameters for fast allocation • Virtual stack head • Etc. • TLS accesses tend to be very frequent • Therefore, a callee-save register (ebx) is reserved to hold TLS pointer within managed code • Substantial performance gain despite loss of register • In a pcall/prscall, child inherits parent’s language-specific TLS pointer • Child may want to override Pillar - Jim Stichnoth - 2008-12-01
Cooperative preemption • Only a certain subset of instructions are GC-safe • I.e., the root set can be accurately determined • Compiler typically chooses the function entry and call sites as GC safepoints • Compiler generates calls to prtYield() at GC safepoints • Fast-path: check whether the TLS yield semaphore field is set • Can be inlined by compiler • Pillar runtime provides a suspend/resume interface • And an interface for iterating over threads Pillar - Jim Stichnoth - 2008-12-01
McRT • McRT = Many-core RunTime • Internal platform for concurrency research • Features include: • Thread creation & scheduling • Large set of synchronization primitives • Scalable malloc/free • Software transactional memory • Pillar requires a few enhancements • Pillar-provided thread-id for synchronization • Maintain appearance of separate threads for prscall parent & child • Allows blocked threads to unblock to respond to suspend requests • Thread enumeration, in the presence of thread creation and dying • Ability to enumerate GC roots of “unborn” threads • Idle-wait function that can trigger prscall continuation stealing Pillar - Jim Stichnoth - 2008-12-01
Private nurseries • Observation: High allocation rate of short-lived objects kills scalability • Some combination of memory & cache coherence traffic • The more we improved sequential performance, the worse scalability became! • Solution: Allocate from a thread-local “private nursery” • Invariant: No heap objects outside the private nursery point into the private nursery • If an update of an object field would break the invariant, do a private nursery collection • Move all live objects from the private nursery to the regular heap • Reset the private nursery • Doesn’t require stopping any other threads Pillar - Jim Stichnoth - 2008-12-01
Private nurseries • Problem: Finding roots in deep stacks can be expensive • Observation: Deeper portions of stack tend to remain unchanged • Solution: “high-water marks” on stack • Each stack frame contains a high-water mark • Mark is cleared upon function entry • Stack walking interface allows mark to be set, and status to be queried • Stack walk for a private nursery collection can terminate early when a marked frame is found • Problem: “Unstable” performance with private nurseries • Overly frequent collections can kill performance • Hard to predict from static analysis of code • Experimenting with more general forms of escape analysis to reduce private nursery collections Pillar - Jim Stichnoth - 2008-12-01
Prscall • Leave a special prscall frame/VSE on the stack • Flag indicates whether prscall continuation has been stolen • Provide some extra space for stolen continuation to expand into • Create new thread ID for child • Call child function • VSE destructor prevents cutting into “different” thread • Idle processor/thread suspends threads, looking for prscall • May be beneficial to steal deepest prscall • Set the continuation-stolen flag • Split the stack between parent and child • Including the virtual stack • Force a private-nursery collection • Cut to a continuation found within VSE, which returns to caller • Child returns to prscall frame, finds continuation-stolen flag set, exits Pillar - Jim Stichnoth - 2008-12-01
Prscall challenges • Expected benefits • Dynamic load balancing • No locking in the common case • Auxiliary storage managed on stack, not heap • Difficulties/drawbacks • Stack limit check on every function entry • Inlining reduces function calls; combining with yield check helps • Possible stack extension/retraction hysteresis after stealing • What’s the best policy for where to steal? • Stealing also has to do a private nursery collection • Without high-water mark optimization • Finding the right granularity of concurrency Pillar - Jim Stichnoth - 2008-12-01
Concurrent functional language • Working with a game company to design & implement the language • Novel type system • Functional style restricts dependencies, eases parallelization • Compilation/execution strategy: • Create thunks/closures to be evaluated • Compiler optimizations reduce number of thunks to evaluate, objects to allocate • Some (or all!) thunks can be spawned as futures • Vectorization for Larrabee • Pure C path in addition to Pillar • Boehm-Demers-Weiser conservative collector • Performance & scalability problems • Setjmp/longjmp Pillar - Jim Stichnoth - 2008-12-01
Future research • Affinity • Automatic (& semi-automatic) means of scheduling threads near their data • Transactional memory • Find the right division of work between Pillar and high-level language • Interactions between transactions, thread creation, & cuts/exceptions • Bulk spawns & vectorization for data parallelism • Other task parallel models Pillar - Jim Stichnoth - 2008-12-01