Efficient Concurrent Mark-Sweep Cycle Collection

Efficient Concurrent Mark-Sweep Cycle Collection Daniel Frampton, Stephen Blackburn, Luke Quinane and John Zigman (Pending submission) Presented by Jose Joao CS395T - Mar 23, 2009

Outline • Motivation • Backup tracing • Trial deletion • Mark-Sweep Cycle Detection (MSCD) • Results • What worked and what didn’t • Discussion

Motivation • Reference counting can directly (i.e. locally) identify garbage • Low pause times • Reasonable throughput (deferred , coalescing, ulterior) • But it cannot reclaim circular garbage • Existing general solutions are expensive: • Trace the whole heap (backup tracing) • Temporarily delete an object and see if the cycle collapses (trial deletion)

Trial deletion • Is partial mark-sweep (no roots required): find objects that are alive only because they are reachable from themselves • Three phases: • Assume candidate object is dead and mark&decrement children recursively. • Trace again from candidate object, marking &incrementing if some RC is not zero, i.e. if the object is externally reachable • Sweep objects with a zero count • Bacon and Rajan: process candidates en masse, avoid acyclic objects, concurrent algorithm • Usually less efficient than concurrent tracing

Backup tracing • Trace all live objects and sweep the entire heap • Shortcomings: • Increases pause times • Concurrency for low pause times requires synchronization, e.g. write barrier • Visits all objects, although some cannot be part of a cycle

MSCD: base algorithm • Add roots to mark queue • Mark until empty mark queue • Pop from queue and process (mark, scan and add children to queue) • Enqueue objects subject to races (fixup set) • Sweep

MSCD: concurrency • Builds on top of coalescing RC with a snapshot-at-the-beginning write barrier: Atomic state update to process each object only once • Record all pre-mutation pointers for deferred decrement RC • Record object as mutated

MSCD: concurrency RC(C): 1 → 2 → 1 Black: marked and scanned Grey: marked, not yet scanned White: not yet visited C is never visited and incorrectly collected • Necessary conditions for a race: • Create a pointer from a black to a white object C • Destroy the last path from a grey object to that white object C RC(C): 1 → 2 → 1 Again, C is never visited and incorrectly collected Same here… RC(E): 2 → 1

MSCD: concurrency Key insight: how to reduce the size of fixup set? Use the set of objects with RC decremented to a non-zero value • These decrements are necessary condition for cyclic garbage • These decrements are uncommon • Easy to identify while processing the decrement buffer (after increments) • Robust to coalescing of reference counts • These are the purple objects or candidates for trial deletion (Bacon&Rajan) • It’s enough to compute this set at tracing time • Trade-offs?

MSCD: marking • Statically determine acyclic classes: • No pointer fields, or • Can point only to acyclic classes • Set green bit in header of acyclic objects at allocation time • Ignore green objects for the fixup set (step 2.2 of base algorithm?) • why only step 2.2? How about step 2.1? • the sweep phase also has to consider green objects as marked • How about green objects pointed to only by non-green objects in a cycle? • Trade-offs?

MSCD: sweeping • Sweep only potentially cyclic objects and their children • Start with all purple objects • Trade-offs? • Much cheaper than scanning the heap • Require keeping the set of all purple objects identified since last cycle detection, not only during tracing • Space overhead • Time overhead of filtering the purple set from RC-collected objects • Overhead increases with time between cycle detections!

MSCD: implementation • Interaction with the reference counter • Establish roots atomically • Add completefixup set to mark queue • RC must not free objects pointed to by MSCD (mark queue and fixup queue): free buffer • Invocation heuristics • When RC is unable to free enough memory (?) • Heap fullness threshold • Size of the purple set • Can do trial deletion or backup tracing instead of MSCD

MSCD: possible timing Mutator RC Mutator RC Mutator RC Mutator New (grey) New (grey) Roots Fixup Fixup Sweeping marking Final marking MSCD: marking Fixup Fixup Fixup

Methodology and Results • Jikes RVM 2.3.4+CVS, MMTk • Dacapo beta050224, SPECjvm98 and pseudojbb • Stop-the-world (i.e. limit) throughput: • Trial deletion is about 70% worse than Backup MS, while MSCD is about 20% better than Backup MS. • MSCD visits only 12% fewer nodes: • green objects on the fringe still have to be visited, • green objects are short lived (many allocated, fewer on the heap at a given time) • MSCD has about 7% cheaper cost per visited node: • green objects not scanned, • sweep optimization

More Results • Concurrent throughput: • Bug in base and MSCD running on SMT (why not CMP?) • Time-slicing (i.e. single-context uniprocessor): no benefit from concurrency optimization → fixup is too small • Overall performance (stop-the-world CD triggered by insufficient reclamation by RC): • MSCD with mark opt. is better than MSCD with both mark and sweep opt. due to overhead of maintaining the purple set • Overhead of gray bit and green bit • Heuristics to trigger CD matters, especially on tight heaps • Generations (e.g. ulterior RC) could reduce cycle detection load

Discussion • Main ideas: reduce the cost of backup MS by: • stopping mark at the green-object frontier, • start sweep from purple objects, • reusing the concurrency mechanism from coalescing RC • Figure 6 shows about 50% of the total time is GC+CD (!) • Baseline is non-generational deferred/coalescing RC. • Why not testing concurrency on CMP in addition to/instead of SMT? • Synchronization is still required in the write barrier, although they claim the guard can be removed (?)   ?

Open questions • Invocation heuristics (trade-offs?) • When running out of heap • At some heap occupancy threshold • Some form of estimating that there is enough cyclic garbage to trigger CD? • Hints from programmer/compiler? • Can we do better with CMPs?

Qustions for the authors • Old version of Jikes RVM. Why? Does it matter? • For xalan and compress, green% + cycle% > 100% • Table 2 and Figure 5 don’t agree

Efficient Concurrent Mark-Sweep Cycle Collection