300 likes | 455 Views
Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection Auerbach, Bacon, Cheng, Grove IBM Research Biron, Gracie, Micic, Sciampacone IBM SWG McCloskey U.C. Berkeley
E N D
Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection Auerbach, Bacon, Cheng, Grove IBM Research Biron, Gracie, Micic, Sciampacone IBM SWG McCloskey U.C. Berkeley Special thanks to the authors for sharing the slides, they used in EMSOFT ‘ O8
Metronome Project Manifesto Bring the productivity, reliability, security, and portability advantages of modern object oriented languages to the construction of complex real-time systems.
Metronome Project Activities Real-time Garbage Collection – Metronome (IBM WebSphere Real-Time) – Metronome-TS Performance understanding tools – TuningFork (sourceforge.net) Programming Models – Eventrons, Exotasks, Flexotasks (Salzburg, Purdue, EPFL) Testbed applications – Harmonicon; Javiator (Salzburg)
Real-time Garbage Collection Garbage Collection – Automatic memory management Programmer only allocates memory GC automatically recovers unreachable memory – Productivity, Reliability, Security – Rich variety of GC algorithms and approaches Real-Time Garbage Collection – Provides time and space bounds – Not just “short” GC pauses
Real-time GC in the Real World Domains – Defense systems (USN Zumwalt-class destroyer) – Telecommunications (SIP) – Finance (stock trading) Vendors – IBM: WebSphere Real-Time (Metronome) – Sun: Java RTS – Azul Systems – BEA: WebLogic Real-Time
Why More RTGC Research? Expanding scope of real-time applications – Varying application characteristics • Classic periodic systems • Queue-based systems • Adaptive, interactive, ... What are these systems ? – Varying operating environments • OS functionality (RTOS? RT Linux? Stock Unix?) • Uni-processor vs. Multi-processor • Dedicated vs. multi-programmed workloads No existing system robustly handles the entire space of combinations
Central Issue: Scheduling Scheduling problem – When to do GC work? – How much GC work to do at a time? Challenges – Complex global invariants & data structures – Complete entire GC cycle before space reclaimed – Work required for GC cycle can be unpredictable – Scheduling just enough work to ensure completion
Agenda Tax-and-Spend Scheduling – Slack-based – Tax-based – Tax-and-Spend From Metronome to Metronome-TS Empirical results Conclusions
Slack-based [Henriksson] GC runs only during “slack” periods • No runnable “critical” application threads Requires • Concurrent GC algorithm • Programmer identification of critical threads Assessment Familiar real-time systems paradigm Can exploit excess capacity and SMP systems Critical threads run with minimal GC interference x Identification of critical threads x Catastrophic failure when insufficient slack or overload
Tax-Based Interrupt application to perform GC work Two taxation schemes – Work-based [Baker] – Time-based [Bacon et al] (Metronome) Both schemes require highly incremental GC – GC work broken into small slices – 100s or 1000s of slices in a single GC cycle
Work-Based Taxation For each N units of allocation work done by the application, perform c*N units of GC work Assessment Provable space bounds: GC will complete in time x Highly variable effective pause times x Unable to exploit excess capacity to reduce tax
Time-Based Taxation (MMU) For every N time units the application runs, do N/k time units of GC work Requires accurate low overhead OS timers Assessment Predictable scheduling and pause times Provable worse case time/space bounds x Unable to exploit excess capacity to reduce tax
Tax-and-Spend Scheduling Per-thread time-based taxation – Each application thread has tax rate (MMU target) – Time is per-thread CPU time Tax credits – Created by low-priority GC background threads – Reduce the effective application tax rate Simple tax laws work well in practice – Same tax rate for all application threads – Tax credits shared equally among threads
Collecting taxes Where to collect taxes – Allocation slow paths (covers most applications) – Time-triggered yield points (when not allocating) When an application thread owes taxes – Attempt to pay tax by withdrawing credit from bank Success – GC is “ahead” due to background threads (deposit credits) – Immediately resume application work Failure – If partial credit, do reduced GC work quantum – If no credit, do full GC work quantum
Metronome vs Metronome-TS Metronome’s tax-based scheduler is global – Monolithic policy for entire JVM Metronome-TS schedules GC per-thread – Taxation concurrent and asynchronous – Different threads can have different tax rates – “Critical” threads can run with minimal GC interference – Background GC threads exploit excess CPU capacity
Requirements for Tax-and-Spend Operating system – Accurate per-thread CPU timer Standard on recent Linux kernels (eg RHEL-5) GC Algorithm – Fully concurrent – Highly incremental – Parallel – GC work done on application and GC threads
Agenda Tax-and-Spend Scheduling From Metronome to Metronome-TS Distributed Agreement Ensuring Progress Empirical results Conclusions
Distributed Agreement GC algorithms require global agreement – GC cycle started/completed – Enable/disable write barriers – Trace completed (all live objects found) – Other instances induced by Java semantics Metronome – Not concurrent; uses synchronous agreement Metronome-TS – Fully concurrent; needs asynchronous agreement
Ragged Epochs A single monotonic global epoch number Per thread local epochs – Always less than or equal to global epoch “Each time” a thread reaches a safe-point: – It reads from the global epoch – Uses global epoch number as its new local epoch Agreement protocol – A thread modifies shared global state, atomically increments global epoch and remembers the new value – All local epochs ≥ remembered value implies agreement
Last Man Out Agreement only between threads doing GC “right now” – What type of work should they be doing? – Can they transition from one phase of GC to next? – Ragged Epoch is overkill (involves all threads in system) GC Phase & worker count kept in one machine word – Worker enters: atomically increment count Phase encodes what work to do – Worker exits: atomically decrement count – Phase change: • Last worker out: atomically changes phase & count
Last-Man-out Special Case for Marking phase Write barriers of different worker threads can have objects in their stacks A single thread responsible for marking end of marking phase. Use ragged epoch mechanism to detect whether all write buffers are empty. What other approach we can have ? What are soft and weak references, string interning and finalization in Java ? How it affects garbage collection?
Ensuring Progress Symptoms – Threads may not execute safe points in a timely fashion, stalling Ragged Epoch – Threads may get stuck while doing GC work Cause – OS scheduling: multi-programming, priorities Solution – Detect and priority boost laggard threads – In effect, priority inheritance on logical resource
Agenda Tax-and-Spend Scheduling From Metronome to Metronome-TS Empirical results – Methodology – SPECjbb2000 – SPECjvm98, DaCapo – Critical section deferral Conclusions
Methodology Implemented in IBM's J9 VM – Identical baseline for Metronome & Metronome-TS LS-41 (8-way AMD-based blade) – Running RHEL5-MRG (Real-Time Linux) – TuningFork: instrumented App, JVM, Linux kernel Enable detailed analysis of MMU, pauses, etc. – Taskset to segregate instrumentation to 1 CPU SPECjbb2000, SPECjvm98, DaCapo
Metronome vs. Metronome-TS Metronome-TS uniformly better than Metronome 2.5x lower max transaction time 1.6x lower 99.999% transaction time 20% higher throughput
Exploiting Excess Capacity Metronome-TS smoothly and robustly exploits excess CPU capacity 15% throughout improvement with background threads Experiments with normal and real-time “hamsters” in paper: under load no degradation from background threads
Summary SPECjvm98 & DaCapo 18 programs: with/without background threads – MMU@4ms: 60% or better – Max GC pause: <400 microseconds • hsqdlb, lusearch, xalan have longer pauses due to OS context switch during GC quantum – Background threads effectively offload GC work when system has excess CPU capacity • Greatly reduce median & std dev of GC pauses • Increase application throughput
Summary of Empirical Results Better determinism than Metronome – Higher MMU with smaller window size – Even with stricter MMU definition (includes barrier and allocation slow paths as GC work) – Significant improvements in SPECjbb2000 2.5x lower max transaction time 1.6x lower 99.999% transaction time Better throughput than Metronome
Summary of Contributions Tax-and-Spend Scheduling – Combines desirable properties of Tax-based and Slack-based approaches – Unified paradigm that supports wide range of application types and operating environments Metronome-TS implementation – Highly incremental, fully concurrent, parallel GC that supports all Java language features – Applied rigorous per-thread MMU metric – Protocols for distributed agreement
Discussion What is the performance of Metronome-TS with small number of cores e.g 2 cores. Used a huge memory of 12GB for experiments. What is the performance in case of memory constrained systems ? Overload condition just defines that no of processor <= Number of worker thread What is the utilization of each processor in overload ? Do we overload memory ? What the performance of each phase of GC in these conditions ?