1 / 52

Eliminating Read Barriers through Procrastination and Cleanliness

Eliminating Read Barriers through Procrastination and Cleanliness. KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan. MultiMLton. Deterministic Parallelism Effect Isolation. Asynchronous CML (ACML) Parasitic Threads GC?. MLton for many-cores

pepin
Download Presentation

Eliminating Read Barriers through Procrastination and Cleanliness

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eliminating Read Barriers through Procrastination and Cleanliness KC Sivaramakrishnan Lukasz Ziarek Suresh Jagannathan

  2. MultiMLton • Deterministic Parallelism • Effect Isolation • Asynchronous CML (ACML) • Parasitic Threads • GC? • MLton for many-cores • Standard ML – functional PL with side-effects • Goals – Safe and Scalable programs

  3. MultiMLton - Runtime System • User-level threads • Preemptive scheduling • Work-pushing Asynchronous CML Scheduler Substrate SML One-shot continuations Parasitic Threads VProc VProc VProc VProc C

  4. Stop-the-world, Serial GC • MLton GC  MultiMLton GC quickly • Sansom’s “Dual-mode garbage collection” • Dynamically switch between 2-space to 1-space • Cheney’s copying  Jonkers’ sliding mark-compact • No fragmentation • Bump-pointer allocation • Appel’s Generational collection • Adding multicore support • Memory allocated modified for local allocation • GC is still stop-the-world serial

  5. How did we do?

  6. Many-core architectural trends AMD “MagnyCours” 48-cores Tilera Tile64 64-cores Intel SCC 48-cores • Many-core architectural trends • NUMA effects • Cache coherence

  7. Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc

  8. Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc

  9. Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc

  10. Local collector Shared Heap Local Heap Local Heap Local Heap Local Heap VProc VProc VProc VProc No Synchronization for local allocation/collection! Local collection is Samson’s Dual mode Shared heap is not the nth generation

  11. Thread-local collectors D. Doligez et al. (POPL’93) – SML with threads R. Jones et al. (SCAM '05) – Java B. Steensgaard (ISMM ’00) – subset of Java T. Anderson (ISMM’10) – A variant of MIT’s pH S. Marlow et al. (ISMM’11) –GHC S. Auhagen et al. (MSPC’11)– Manticore

  12. Write Barrier Shared Heap r := x r Target Exporting writes Local Heap Source x

  13. Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap

  14. Write Barrier Shared Heap r := x r x Transitive closure of x Local Heap x

  15. Write Barrier Shared Heap r := x r x Local Heap FWD Mutator needs read barrier Mutations <<< Reads

  16. Read Barrier Overheads 20.1 % 15.3 % 21.3 %

  17. Read Barrier Statistics pointer readBarrier (pointer *p) { if (getHeader(p) == FORWARDED) return *(pointer*)p; return p; } Checks Forwarded

  18. Eliminate read barriers? • No need for read barriers if mutator can never witness forwarded objects • Do a local GC every time you export • Slower than with read barriers • Dynamically ensure mutators never get to see forwarded objects. • Procrastination: Exploit program concurrency to delay exporting writes • Cleanliness: Object closure cleanliness

  19. New idea: Procrastination T1 T2 Shared Heap r1 r2  r1 := x1 r2 := x2 Local Heap x1 x2 T  T is running T  T is suspended T  T is blocked

  20. Procrastination T1 T2 Shared Heap r1 r2 r1 := x1  r2 := x2 Control switches to T2 Local Heap x1 x2 T  T is running Delayed write list  T  T is suspended T  T is blocked

  21. Procrastination T1 T2 Shared Heap r1 r2 r1 := x1 r2 := x2 Local Heap x1 x2 T  T is running Delayed write list  T  T is suspended T  T is blocked

  22. Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1 r2 := x2 Local Heap x1 T  T is running Delayed write list  T  T is suspended T  T is blocked

  23. Procrastination T1 T2 Shared Heap r1 x1 r2 x2 r1 := x1  … r2 := x2 Local Heap Force local GC T  T is running Delayed write list  T  T is suspended T  T is blocked

  24. Is Procrastination alone enough? Procrastination depends on availability of Runnable threads @ exporting write Runnable threads << Total threads (Thread Density) Eager exporting writes preserving “mutator never sees forwarding pointers” invariant.

  25. Exporting write characteristics • Sources of exporting writes • Immutable >> Mutable • Tend to be young • References rarely from outside the closure (other than stacks) • Object closure cleanliness • Heap Sessions (Young objects) • Reference counts (Safety of eager export)

  26. Heap Session Local Heap Previous Session Current Session Free SessionStart Frontier • Sessions closed/started after a • User-level thread switch • Exporting write • Local GC

  27. Reference Counting Local heap allocated object Object in current session Count number of references to current session objects Does not consider references from stacks or registers Count is one of ZERO, ONE, LOCAL_MANY, GLOBAL

  28. Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session.

  29. Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session.

  30. Cleanliness • An object closure is said to be clean, if for each object O in the closure • O is immutable or is in the shared heap. Or, • O is the root, and has ZERO references. Or, • O is not the root, and has ONE reference. Or, • O is not the root, has LOCAL_MANY references, and is in the current session. • Boils down to 2 cases: • Tree-structured closure • Arbitrary Graph

  31. Tree-structured closure

  32. Graph – Session Based Trace current session

  33. Write Barrier 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}

  34. Write Barrier • Summary • Read barrier are expensive in MultiMLton • Eliminate read barriers by avoiding mutator from ever witnessing forwarding pointers 1: ValwriteBarrier (Ref r, Val v) { 2:if(isInSharedHeap (r) && isInLocalHeap (v)) { 3: needsFixup= False; 4:if(isClean(v, &needsFixup)) 5: v = lift(v, needsFixup); //lift eagerly 6:else 7: v = suspendTillGCAndLift (v); //delay write 8: } 9: return v; 10:}

  35. Benchmark Characteristics Lots of concurrency Low sharing

  36. Performance on AMD At 3X: --------- RB+ 32% STW 106% BDW 584%

  37. Performance on AZUL At 3X: --------- RB+ 30%

  38. MultiMLton - SCC implementation Shared heap Local heap • No cache-cache coherence • Cluster-on-chip Architecture • Private off-die DRAM Regions (one per Core) • Caches enabled! One Linux instance per Core! • Local heaps reside here • Shared / Global off-die DRAM Region • Caches disabled per default! • Shared heap resides here • Shared on-die MPB Regions • Cached in L1, L2 Bypass / Fast L1 Invalidation for MPB-Data • Coordinating VProcs

  39. Performance on SCC At 3X: --------- RB+ 20%

  40. Cleanliness Impact (1)

  41. Cleanliness Impact (2) Low thread density

  42. Session Impact

  43. Conclusion • Local collectors seem to be a good choice for many-core architectures • Better Cache Behavior • Minimize NUMA effects • Overcome cache coherence issues (partially) • Read barriers in local collectors can be expensive • Eliminate them through procrastination and cleanliness

  44. Backup slides

  45. MLton Heap Layout From Space (major) Nursery Heap To Space (major) Old Gen To Space (minor) Nursery

  46. MLton GC – Minor Collection To Space (major) Old Gen To Space (minor) Nursery To Space (major) Old Gen To Space (minor) Nursery

  47. MLton GC – Major Copying Collection To Space (major) Old Gen Old Gen To Space (minor) Nursery To Space (major) From Space

  48. MLton GC – Major Mark-Compact Old Gen Free Old Gen To Space (minor) Nursery

  49. Read Barrier Unconditional (Brooks style) From From To To Conditional (Baker Style)

  50. Read Barrier Unconditional (Brooks style) From From F F To To pointer readBarrier (pointer *p) { return *(pointer*)(p – IND_OFF); } pointer readBarrier (pointer *p) { if (*(Header*)(p – HD_OFF) == F) return *(pointer*)p; return p; } Has Conditional Check Needs extra header word Conditional (Baker Style)

More Related