520 likes | 705 Views
Eliminating Read Barriers through Procrastination and Cleanliness. KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan. MultiMLton. Deterministic Parallelism Effect Isolation. Asynchronous CML (ACML) Parasitic Threads GC?. MLton for many-cores
E N D
Eliminating Read Barriers through Procrastination and Cleanliness KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan
MultiMLton • Deterministic Parallelism • Effect Isolation • Asynchronous CML (ACML) • Parasitic Threads • GC? • MLton for many-cores • Standard ML – functional PL with side-effects • Goals – Safe and Scalable programs
MultiMLton - Runtime System • User-level threads • Preemptive scheduling • Work-pushing Asynchronous CML Scheduler Substrate SML One-shot continuations Parasitic Threads VProc VProc VProc VProc C
Stop-the-world, Serial GC • MLton GC MultiMLton GC quickly • Sansom’s “Dual-mode garbage collection” • Dynamically switch between 2-space to 1-space • Cheney’s copying Jonkers’ sliding mark-compact • No fragmentation • Bump-pointer allocation • Appel’s Generational collection • Adding multicore support • Memory allocated modified for local allocation • GC is still stop-the-world serial
Many-core architectural trends AMD “MagnyCours” 48-cores Tilera Tile64 64-cores Intel SCC 48-cores • Many-core architectural trends • NUMA effects • Cache coherence
Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc
Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc
Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc
Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc No Synchronization for local allocation/collection! Local collection is Samson’s Dual mode Shared heap is not the nth generation
Thread-local collectors D. Doligez et al. (POPL’93) – SML with threads R. Jones et al. (SCAM '05) – Java B. Steensgaard (ISMM ’00) – subset of Java T. Anderson (ISMM’10) – A variant of MIT’s pH S. Marlow et al. (ISMM’11) –GHC S. Auhagen et al. (MSPC’11)– Manticore
Write Barrier Shared Heap r := x r Target Exporting writes Local Heap Source x
Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap
Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap x
Write Barrier Shared Heap r := x r x Local Heap FWD Mutator needs read barrier Mutations <<< Reads
Read Barrier Overheads 20.1 % 15.3 % 21.3 %
Read Barrier Statistics pointer readBarrier (pointer *p) { if (getHeader(p) == FORWARDED) return *(pointer*)p; return p; } Checks Forwarded
Eliminate read barriers? • No need for read barriers if mutator can never witness forwarded objects • Do a local GC every time you export • Slower than with read barriers • Dynamically ensure mutators never get to see forwarded objects. • Procrastination: Exploit program concurrency to delay exporting writes • Cleanliness: Object closure cleanliness
New idea: Procrastination T1 T2 Shared Heap r1 r2 r1 := x1 r2 := x2 Local Heap x1 x2 T T is running T T is suspended T T is blocked
Procrastination T1 T2 Shared Heap r1 r2 r1 := x1 r2 := x2 Control switches to T2 Local Heap x1 x2 T T is running Delayed write list T T is suspended T T is blocked
Procrastination T1 T2 Shared Heap r1 r2 r1 := x1 r2 := x2 Local Heap x1 x2 T T is running Delayed write list T T is suspended T T is blocked
Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1 r2 := x2 Local Heap x1 T T is running Delayed write list T T is suspended T T is blocked
Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1 … r2 := x2 Local Heap Force local GC T T is running Delayed write list T T is suspended T T is blocked
Is Procrastination alone enough? Procrastination depends on availability of Runnable threads @ exporting write Runnable threads << Total threads (Thread Density) Eager exporting writes preserving “mutator never sees forwarding pointers” invariant.
Exporting write characteristics • Sources of exporting writes • Immutable >> Mutable • Tend to be young • References rarely from outside the closure (other than stacks) • Object closure cleanliness • Heap Sessions (Young objects) • Reference counts (Safety of eager export)
Heap Session Local Heap Previous Session Current Session Free SessionStart Frontier • Sessions closed/started after a • User-level thread switch • Exporting write • Local GC
Reference Counting Local heap allocated object Object in current session Count number of references to current session objects Does not consider references from stacks or registers Count is one of ZERO, ONE, LOCAL_MANY, GLOBAL
Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session.
Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session.
Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session. • Boils down to 2 cases: • Tree-structured closure • Arbitrary Graph
Graph – Session Based Trace current session
Write Barrier 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}
Write Barrier • Summary • Read barrier are expensive in MultiMLton • Eliminate read barriers by avoiding mutator from ever witnessing forwarding pointers 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}
Benchmark Characteristics Lots of concurrency Low sharing
Performance on AMD At 3X: --------- RB+ 32% STW 106% BDW 584%
Performance on AZUL At 3X: --------- RB+ 30%
MultiMLton - SCC implementation Shared heap Local heap • No cache-cache coherence • Cluster-on-chip Architecture • Private off-die DRAM Regions (one per Core) • Caches enabled! One Linux instance per Core! • Local heaps reside here • Shared / Global off-die DRAM Region • Caches disabled per default! • Shared heap resides here • Shared on-die MPB Regions • Cached in L1, L2 Bypass / Fast L1 Invalidation for MPB-Data • Coordinating VProcs
Performance on SCC At 3X: --------- RB+ 20%
Cleanliness Impact (2) Low thread density
Conclusion • Local collectors seem to be a good choice for many-core architectures • Better Cache Behavior • Minimize NUMA effects • Overcome cache coherence issues (partially) • Read barriers in local collectors can be expensive • Eliminate them through procrastination and cleanliness
MLton Heap Layout From Space (major) Nursery Heap To Space (major) Old Gen To Space (minor) Nursery
MLton GC – Minor Collection To Space (major) Old Gen To Space (minor) Nursery To Space (major) Old Gen To Space (minor) Nursery
MLton GC – Major Copying Collection To Space (major) Old Gen Old Gen To Space (minor) Nursery To Space (major) From Space
MLton GC – Major Mark-Compact Old Gen Free Old Gen To Space (minor) Nursery
Read Barrier Unconditional (Brooks style) From From To To Conditional (Baker Style)
Read Barrier Unconditional (Brooks style) From From F F To To pointer readBarrier (pointer *p) { return *(pointer*)(p – IND_OFF); } pointer readBarrier (pointer *p) { if (*(Header*)(p – HD_OFF) == F) return *(pointer*)p; return p; } Has Conditional Check Needs extra header word Conditional (Baker Style)