David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center

The Metronome: A Hard Real-time Garbage Collector David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center

The Problem • Real-time systems growing in importance • Many CPUs in a BMW: “80% of innovation in SW” • Programmers left behind • Still using assembler and C • Lack productivity advantages of Java • Result • Complexity • Low reliability or very high validation cost

Problem Domain • Hard real-time garbage collection • Uniprocessor • Multiprocessors very rare in real-time systems • Complication: collector must be finely interleaved • Simplification: memory model easier to program • No truly concurrent operations • Sequentially consistent

Metronome Project Goals • Make GC feasible for hard real-time systems • Provide simple application interface • Develop technology that is efficient: • Throughput, Space comparable to stop-the-world • BREAK THE MILLISECOND BARRIER • While providing even CPU utilization

Outline • What is “Real-Time” Garbage Collection? • Problems with Previous Work • Overview of the Metronome • Scheduling • Empirical Results • Conclusions

What is “Real-Time”Garbage Collection?

3 Uniprocessor GC Types STW Inc RT time GC #1 GC #2

Utilization vs Time: 2s Window STW Inc RT

Utilization vs Time: .4 s Window STW Inc RT

Minimum Mutator Utilization STW Inc RT

What is Real-time? It Is… • Maximum pause time < required response • CPU Utilization sufficient to accomplish task • Measured with MMU • Memory requirement < resource limit

Problems with Previous Work

No Compaction • Hypothesis: fragmentation not a problem • Can use avoidance and coalescing [ Johnstone] • Non-moving incremental collection is simpler • Problem: long-running applications • Reboot your car every 500 miles? 4 X max live 3 X max live 2 X max live 1 X max live

Copying Collection • Idea: copy into to-space concurrently • Compaction is part of basic operation • Problem: space usage • 2 semi-spaces plus space to mutate during GC • Requires 4-8 X max live data in practice 4 X max live 3 X max live 2 X max live 1 X max live

Work-Based Scheduling • The Baker fallacy: • “A real-time list processing system is one in which the time required by the elementary list processing operations…is bounded by a small constant” [Baker’78] • Implicitly assumes GC work done in mutator • What does “small constant” mean? • Typically, constant is not so small • And there is variability (fault rate analogy)

Overview of the Metronome

Is it real-time? Yes • Maximum pause time < 4 ms • MMU > 50% ±2% • Memory requirement < 2 X max live

Components of the Metronome • Segregated free list allocator • Geometric size progression limits internal fragmentation • Write barrier: snapshot-at-the-beginning [Yuasa] • Read barrier: to-space invariant [Brooks] • New techniques with only 4% overhead • Incremental mark-sweep collector • Mark phase fixes stale pointers • Selective incremental defragmentation • Moves < 2% of traced objects • Arraylets: bound fragmentation, large object ops • Time-based scheduling Old New

Segregated Free List Allocator • Heap divided into fixed-size pages • Each page divided into fixed-size blocks • Objects allocated in smallest block that fits 12 16 24

Fragmentation on a Page • Internal: wasted space at end of object • Page-internal: wasted space at end of page • External: blocks needed for other size page-internal internal external

Limiting Internal Fragmentation • Choose page size P and block sizes sk such that • sk = sk-1(1+ρ) • smax = P ρ • Then • Internal and page-internal fragmentation < ρ • Example: • P =16KB, ρ =1/8, smax = 2KB • Internal and page-internal fragmentation < 1/8

Fragmentation: ρ=1/8 vs ρ=1/2

Write Barrier: Snapshot-at-start • Problem: mutator changes object graph • Solution: write barrier prevents lost objects • Logically, collector takes atomic snapshot • Objects live at snapshot will not be collected • Write barrier saves overwritten pointers [Yuasa] • Write buffer must be drained periodically B B A C WB

Read Barrier: To-space Invariant • Problem: Collector moves objects (defragmentation) • and mutator is finely interleaved • Solution: read barrier ensures consistency • Each object contains a forwarding pointer [Brooks] • Read barrier unconditionally forwards all pointers • Mutator never sees old versions of objects X X Y A Y A A′ Z Z From-space To-space BEFORE AFTER

Read Barrier Optimizations • Barrier variants: when to redirect • Lazy: easier for collector (no fixup) • Eager: better for performance (loop over a[i]) • Standard optimizations: CSE, code motion • Problem: pointers can be null • Augment read barrier for GetField(x,offset): tmp = x[offset]; return tmp == null ? null : tmp[redirect] • Optimize by null-check combining, sinking

Read Barrier Results • Conventional wisdom: read barriers too slow • Previous studies: 20-40% overhead [Zorn,Nielsen]

Incremental Mark-Sweep • Mark/sweep finely interleaved with mutator • Write barrier ensures no lost objects • Must treat objects in write buffer as roots • Read barrier ensures consistency • Marker always traces correct object • With barriers, interleaving is simple

Pointer Fixup During Mark • When can a moved object be freed? • When there are no more pointers to it • Mark phase updates pointers • Redirects forwarded pointers as it marks them • Object moved in collection n can be freed: • At the end of mark phase of collection n+1 X Y A A′ Z From-space To-space

Selective Defragmentation • When do we move objects? • When there is fragmentation • Usually, program exhibits locality of size • Dead objects are re-used quickly • Defragment either when • Dead objects are not re-used for a GC cycle • Free pages fall below limit for performing a GC • In practice: we move 2-3% of data traced • Major improvement over copying collector

Arraylets A • Large arrays create problems • Fragment memory space • Can not be moved in a short, bounded time • Solution: break large arrays into arraylets • Access via indirection; move one arraylet at a time • Optimizations • Type-dependent code optimized for contiguous case • Opportunistic contiguous allocation A1 A2 A3

Scheduling

Work-based Scheduling • Trigger the collector to collect CW bytes • Whenever the mutator allocates QW bytes MMU (CPU Utilization) Space (MB) Window Size (s) - log Time (s)

Time-based Scheduling • Trigger collector to run for CT seconds • Whenever mutator runs for QT seconds MMU (CPU Utilization) Space (MB) Window Size (s) - log Time (s)

Parameterization Tuner Δt s u Mutator a*(ΔGC) m Allocation Rate Real Time Interval Maximum Live Memory Maximum Used Memory Collector R CPU Utilization Collection Rate

Empirical Results

Pause time distribution: javac Time-based Scheduling Work-based Scheduling 12 ms 12 ms

Utilization vs Time: javac Time-based Scheduling Work-based Scheduling 1.0 1.0 0.8 0.8 Utilization (%) 0.6 0.6 0.45 0.4 0.4 0.2 0.2 0.0 0.0 Time (s) Time (s)

MMU: javac

Space Usage: javac

Conclusions

Conclusions • The Metronome provides true real-time GC • First collector to do so without major sacrifice • Short pauses (4 ms) • High MMU during collection (50%) • Low memory consumption (2x max live) • Critical features • Time-based scheduling • Hybrid, mostly non-copying approach • Integration w/compiler

Future Work • Two main goals: • Reduce pause time, memory requirement • Increase predictability • Pause time: • Expect sub-millisecond using current techniques • For 10’s of microseconds, need interrupt-based • Predictability • Studying parameterization of collector • Good research area

David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center

David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center

Presentation Transcript

Bluetooth Architecture Overview Dr. Chatschik Bisdikian IBM Research T.J. Watson Research Center Hawthorne, NY 10532, US

Paul E. McKenney , IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

David F. Bacon T.J. Watson Research Center

National Cheng Kung University Sustainable Environment Research Center

David F. Bacon T.J. Watson Research Center

Supercomputer Platforms and Its Applications Dr. George Chiu IBM T.J. Watson Research Center

ADVISORY COMMITTEES Jen-Yao Chung • IBM T. J. Watson Research Center, USA

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center

Wendy A. Kellogg, Jason B. Ellis, John C. Thomas IBM T.J. Watson Research Center

Dr. George Chiu IEEE Fellow IBM T.J. Watson Research Center Yorktown Heights, NY

Marco Pistoia, Ted Habeck, Larry Koved IBM T.J. Watson Research Center New York, USA

David Perry Construction

Fred Douglis IBM T.J. Watson Research Center

Scenes from HCI Research at the IBM T.J. Watson Research Lab

Dr. George Chiu IEEE Fellow IBM T.J. Watson Research Center Yorktown Heights, NY

Supercomputer Platforms and Its Applications Dr. George Chiu IBM T.J. Watson Research Center

IBM Watson in Health Care Joel Farrell, IBM