1 / 30

Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection

Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection Auerbach, Bacon, Cheng, Grove IBM Research Biron, Gracie, Micic, Sciampacone IBM SWG McCloskey U.C. Berkeley

thisbe
Download Presentation

Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tax-and-Spend: Democratic Scheduling for Real-time Garbage Collection Auerbach, Bacon, Cheng, Grove IBM Research Biron, Gracie, Micic, Sciampacone IBM SWG McCloskey U.C. Berkeley Special thanks to the authors for sharing the slides, they used in EMSOFT ‘ O8

  2. Metronome Project Manifesto Bring the productivity, reliability, security, and portability advantages of modern object oriented languages to the construction of complex real-time systems.

  3. Metronome Project Activities Real-time Garbage Collection – Metronome (IBM WebSphere Real-Time) – Metronome-TS Performance understanding tools – TuningFork (sourceforge.net) Programming Models – Eventrons, Exotasks, Flexotasks (Salzburg, Purdue, EPFL) Testbed applications – Harmonicon; Javiator (Salzburg)

  4. Real-time Garbage Collection Garbage Collection – Automatic memory management Programmer only allocates memory GC automatically recovers unreachable memory – Productivity, Reliability, Security – Rich variety of GC algorithms and approaches Real-Time Garbage Collection – Provides time and space bounds – Not just “short” GC pauses

  5. Real-time GC in the Real World Domains – Defense systems (USN Zumwalt-class destroyer) – Telecommunications (SIP) – Finance (stock trading) Vendors – IBM: WebSphere Real-Time (Metronome) – Sun: Java RTS – Azul Systems – BEA: WebLogic Real-Time

  6. Why More RTGC Research? Expanding scope of real-time applications – Varying application characteristics • Classic periodic systems • Queue-based systems • Adaptive, interactive, ... What are these systems ? – Varying operating environments • OS functionality (RTOS? RT Linux? Stock Unix?) • Uni-processor vs. Multi-processor • Dedicated vs. multi-programmed workloads No existing system robustly handles the entire space of combinations

  7. Central Issue: Scheduling Scheduling problem – When to do GC work? – How much GC work to do at a time? Challenges – Complex global invariants & data structures – Complete entire GC cycle before space reclaimed – Work required for GC cycle can be unpredictable – Scheduling just enough work to ensure completion

  8. Agenda Tax-and-Spend Scheduling – Slack-based – Tax-based – Tax-and-Spend From Metronome to Metronome-TS Empirical results Conclusions

  9. Slack-based [Henriksson] GC runs only during “slack” periods • No runnable “critical” application threads Requires • Concurrent GC algorithm • Programmer identification of critical threads Assessment Familiar real-time systems paradigm Can exploit excess capacity and SMP systems Critical threads run with minimal GC interference x Identification of critical threads x Catastrophic failure when insufficient slack or overload

  10. Tax-Based Interrupt application to perform GC work Two taxation schemes – Work-based [Baker] – Time-based [Bacon et al] (Metronome) Both schemes require highly incremental GC – GC work broken into small slices – 100s or 1000s of slices in a single GC cycle

  11. Work-Based Taxation For each N units of allocation work done by the application, perform c*N units of GC work Assessment Provable space bounds: GC will complete in time x Highly variable effective pause times x Unable to exploit excess capacity to reduce tax

  12. Time-Based Taxation (MMU) For every N time units the application runs, do N/k time units of GC work Requires accurate low overhead OS timers Assessment Predictable scheduling and pause times Provable worse case time/space bounds x Unable to exploit excess capacity to reduce tax

  13. Tax-and-Spend Scheduling Per-thread time-based taxation – Each application thread has tax rate (MMU target) – Time is per-thread CPU time Tax credits – Created by low-priority GC background threads – Reduce the effective application tax rate Simple tax laws work well in practice – Same tax rate for all application threads – Tax credits shared equally among threads

  14. Collecting taxes Where to collect taxes – Allocation slow paths (covers most applications) – Time-triggered yield points (when not allocating) When an application thread owes taxes – Attempt to pay tax by withdrawing credit from bank Success – GC is “ahead” due to background threads (deposit credits) – Immediately resume application work Failure – If partial credit, do reduced GC work quantum – If no credit, do full GC work quantum

  15. Metronome vs Metronome-TS Metronome’s tax-based scheduler is global – Monolithic policy for entire JVM Metronome-TS schedules GC per-thread – Taxation concurrent and asynchronous – Different threads can have different tax rates – “Critical” threads can run with minimal GC interference – Background GC threads exploit excess CPU capacity

  16. Requirements for Tax-and-Spend Operating system – Accurate per-thread CPU timer Standard on recent Linux kernels (eg RHEL-5) GC Algorithm – Fully concurrent – Highly incremental – Parallel – GC work done on application and GC threads

  17. Agenda Tax-and-Spend Scheduling From Metronome to Metronome-TS Distributed Agreement Ensuring Progress Empirical results Conclusions

  18. Distributed Agreement GC algorithms require global agreement – GC cycle started/completed – Enable/disable write barriers – Trace completed (all live objects found) – Other instances induced by Java semantics Metronome – Not concurrent; uses synchronous agreement Metronome-TS – Fully concurrent; needs asynchronous agreement

  19. Ragged Epochs A single monotonic global epoch number Per thread local epochs – Always less than or equal to global epoch “Each time” a thread reaches a safe-point: – It reads from the global epoch – Uses global epoch number as its new local epoch Agreement protocol – A thread modifies shared global state, atomically increments global epoch and remembers the new value – All local epochs ≥ remembered value implies agreement

  20. Last Man Out Agreement only between threads doing GC “right now” – What type of work should they be doing? – Can they transition from one phase of GC to next? – Ragged Epoch is overkill (involves all threads in system) GC Phase & worker count kept in one machine word – Worker enters: atomically increment count Phase encodes what work to do – Worker exits: atomically decrement count – Phase change: • Last worker out: atomically changes phase & count

  21. Last-Man-out Special Case for Marking phase Write barriers of different worker threads can have objects in their stacks A single thread responsible for marking end of marking phase. Use ragged epoch mechanism to detect whether all write buffers are empty. What other approach we can have ? What are soft and weak references, string interning and finalization in Java ? How it affects garbage collection?

  22. Ensuring Progress Symptoms – Threads may not execute safe points in a timely fashion, stalling Ragged Epoch – Threads may get stuck while doing GC work Cause – OS scheduling: multi-programming, priorities Solution – Detect and priority boost laggard threads – In effect, priority inheritance on logical resource

  23. Agenda Tax-and-Spend Scheduling From Metronome to Metronome-TS Empirical results – Methodology – SPECjbb2000 – SPECjvm98, DaCapo – Critical section deferral Conclusions

  24. Methodology Implemented in IBM's J9 VM – Identical baseline for Metronome & Metronome-TS LS-41 (8-way AMD-based blade) – Running RHEL5-MRG (Real-Time Linux) – TuningFork: instrumented App, JVM, Linux kernel Enable detailed analysis of MMU, pauses, etc. – Taskset to segregate instrumentation to 1 CPU SPECjbb2000, SPECjvm98, DaCapo

  25. Metronome vs. Metronome-TS Metronome-TS uniformly better than Metronome 2.5x lower max transaction time 1.6x lower 99.999% transaction time 20% higher throughput

  26. Exploiting Excess Capacity Metronome-TS smoothly and robustly exploits excess CPU capacity 15% throughout improvement with background threads Experiments with normal and real-time “hamsters” in paper: under load no degradation from background threads

  27. Summary SPECjvm98 & DaCapo 18 programs: with/without background threads – MMU@4ms: 60% or better – Max GC pause: <400 microseconds • hsqdlb, lusearch, xalan have longer pauses due to OS context switch during GC quantum – Background threads effectively offload GC work when system has excess CPU capacity • Greatly reduce median & std dev of GC pauses • Increase application throughput

  28. Summary of Empirical Results Better determinism than Metronome – Higher MMU with smaller window size – Even with stricter MMU definition (includes barrier and allocation slow paths as GC work) – Significant improvements in SPECjbb2000 2.5x lower max transaction time 1.6x lower 99.999% transaction time Better throughput than Metronome

  29. Summary of Contributions Tax-and-Spend Scheduling – Combines desirable properties of Tax-based and Slack-based approaches – Unified paradigm that supports wide range of application types and operating environments Metronome-TS implementation – Highly incremental, fully concurrent, parallel GC that supports all Java language features – Applied rigorous per-thread MMU metric – Protocols for distributed agreement

  30. Discussion What is the performance of Metronome-TS with small number of cores e.g 2 cores. Used a huge memory of 12GB for experiments. What is the performance in case of memory constrained systems ? Overload condition just defines that no of processor <= Number of worker thread What is the utilization of each processor in overload ? Do we overload memory ? What the performance of each phase of GC in these conditions ?

More Related