430 likes | 698 Views
The Metronome: A Hard Real-time Garbage Collector. David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center. The Problem. Real-time systems growing in importance Many CPUs in a BMW: “80% of innovation in SW” Programmers left behind Still using assembler and C
E N D
The Metronome: A Hard Real-time Garbage Collector David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center
The Problem • Real-time systems growing in importance • Many CPUs in a BMW: “80% of innovation in SW” • Programmers left behind • Still using assembler and C • Lack productivity advantages of Java • Result • Complexity • Low reliability or very high validation cost
Problem Domain • Hard real-time garbage collection • Uniprocessor • Multiprocessors very rare in real-time systems • Complication: collector must be finely interleaved • Simplification: memory model easier to program • No truly concurrent operations • Sequentially consistent
Metronome Project Goals • Make GC feasible for hard real-time systems • Provide simple application interface • Develop technology that is efficient: • Throughput, Space comparable to stop-the-world • BREAK THE MILLISECOND BARRIER • While providing even CPU utilization
Outline • What is “Real-Time” Garbage Collection? • Problems with Previous Work • Overview of the Metronome • Scheduling • Empirical Results • Conclusions
3 Uniprocessor GC Types STW Inc RT time GC #1 GC #2
Utilization vs Time: 2s Window STW Inc RT
Utilization vs Time: .4 s Window STW Inc RT
Minimum Mutator Utilization STW Inc RT
What is Real-time? It Is… • Maximum pause time < required response • CPU Utilization sufficient to accomplish task • Measured with MMU • Memory requirement < resource limit
No Compaction • Hypothesis: fragmentation not a problem • Can use avoidance and coalescing [ Johnstone] • Non-moving incremental collection is simpler • Problem: long-running applications • Reboot your car every 500 miles? 4 X max live 3 X max live 2 X max live 1 X max live
Copying Collection • Idea: copy into to-space concurrently • Compaction is part of basic operation • Problem: space usage • 2 semi-spaces plus space to mutate during GC • Requires 4-8 X max live data in practice 4 X max live 3 X max live 2 X max live 1 X max live
Work-Based Scheduling • The Baker fallacy: • “A real-time list processing system is one in which the time required by the elementary list processing operations…is bounded by a small constant” [Baker’78] • Implicitly assumes GC work done in mutator • What does “small constant” mean? • Typically, constant is not so small • And there is variability (fault rate analogy)
Is it real-time? Yes • Maximum pause time < 4 ms • MMU > 50% ±2% • Memory requirement < 2 X max live
Components of the Metronome • Segregated free list allocator • Geometric size progression limits internal fragmentation • Write barrier: snapshot-at-the-beginning [Yuasa] • Read barrier: to-space invariant [Brooks] • New techniques with only 4% overhead • Incremental mark-sweep collector • Mark phase fixes stale pointers • Selective incremental defragmentation • Moves < 2% of traced objects • Arraylets: bound fragmentation, large object ops • Time-based scheduling Old New
Segregated Free List Allocator • Heap divided into fixed-size pages • Each page divided into fixed-size blocks • Objects allocated in smallest block that fits 12 16 24
Fragmentation on a Page • Internal: wasted space at end of object • Page-internal: wasted space at end of page • External: blocks needed for other size page-internal internal external
Limiting Internal Fragmentation • Choose page size P and block sizes sk such that • sk = sk-1(1+ρ) • smax = P ρ • Then • Internal and page-internal fragmentation < ρ • Example: • P =16KB, ρ =1/8, smax = 2KB • Internal and page-internal fragmentation < 1/8
Write Barrier: Snapshot-at-start • Problem: mutator changes object graph • Solution: write barrier prevents lost objects • Logically, collector takes atomic snapshot • Objects live at snapshot will not be collected • Write barrier saves overwritten pointers [Yuasa] • Write buffer must be drained periodically B B A C WB
Read Barrier: To-space Invariant • Problem: Collector moves objects (defragmentation) • and mutator is finely interleaved • Solution: read barrier ensures consistency • Each object contains a forwarding pointer [Brooks] • Read barrier unconditionally forwards all pointers • Mutator never sees old versions of objects X X Y A Y A A′ Z Z From-space To-space BEFORE AFTER
Read Barrier Optimizations • Barrier variants: when to redirect • Lazy: easier for collector (no fixup) • Eager: better for performance (loop over a[i]) • Standard optimizations: CSE, code motion • Problem: pointers can be null • Augment read barrier for GetField(x,offset): tmp = x[offset]; return tmp == null ? null : tmp[redirect] • Optimize by null-check combining, sinking
Read Barrier Results • Conventional wisdom: read barriers too slow • Previous studies: 20-40% overhead [Zorn,Nielsen]
Incremental Mark-Sweep • Mark/sweep finely interleaved with mutator • Write barrier ensures no lost objects • Must treat objects in write buffer as roots • Read barrier ensures consistency • Marker always traces correct object • With barriers, interleaving is simple
Pointer Fixup During Mark • When can a moved object be freed? • When there are no more pointers to it • Mark phase updates pointers • Redirects forwarded pointers as it marks them • Object moved in collection n can be freed: • At the end of mark phase of collection n+1 X Y A A′ Z From-space To-space
Selective Defragmentation • When do we move objects? • When there is fragmentation • Usually, program exhibits locality of size • Dead objects are re-used quickly • Defragment either when • Dead objects are not re-used for a GC cycle • Free pages fall below limit for performing a GC • In practice: we move 2-3% of data traced • Major improvement over copying collector
Arraylets A • Large arrays create problems • Fragment memory space • Can not be moved in a short, bounded time • Solution: break large arrays into arraylets • Access via indirection; move one arraylet at a time • Optimizations • Type-dependent code optimized for contiguous case • Opportunistic contiguous allocation A1 A2 A3
Work-based Scheduling • Trigger the collector to collect CW bytes • Whenever the mutator allocates QW bytes MMU (CPU Utilization) Space (MB) Window Size (s) - log Time (s)
Time-based Scheduling • Trigger collector to run for CT seconds • Whenever mutator runs for QT seconds MMU (CPU Utilization) Space (MB) Window Size (s) - log Time (s)
Parameterization Tuner Δt s u Mutator a*(ΔGC) m Allocation Rate Real Time Interval Maximum Live Memory Maximum Used Memory Collector R CPU Utilization Collection Rate
Pause time distribution: javac Time-based Scheduling Work-based Scheduling 12 ms 12 ms
Utilization vs Time: javac Time-based Scheduling Work-based Scheduling 1.0 1.0 0.8 0.8 Utilization (%) 0.6 0.6 0.45 0.4 0.4 0.2 0.2 0.0 0.0 Time (s) Time (s)
Conclusions • The Metronome provides true real-time GC • First collector to do so without major sacrifice • Short pauses (4 ms) • High MMU during collection (50%) • Low memory consumption (2x max live) • Critical features • Time-based scheduling • Hybrid, mostly non-copying approach • Integration w/compiler
Future Work • Two main goals: • Reduce pause time, memory requirement • Increase predictability • Pause time: • Expect sub-millisecond using current techniques • For 10’s of microseconds, need interrupt-based • Predictability • Studying parameterization of collector • Good research area