1 / 13

Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT ). Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA. Introduction. Why? Heavy use of Managed Runtime Environments Application servers Scientific applications

nasnan
Download Presentation

Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Garbage Collection for Large Scale Multiprocessors(Funded by ANR projects: Prose and ConcoRDanT) LokeshGidraGaël Thomas JulienSopena Marc Shapiro Regal-LIP6/INRIA

  2. Introduction Why? • Heavy use of Managed Runtime Environments • Application servers • Scientific applications • Example: Jboss, Sunflow etc. • Hardware is more and more multi-resourced. • GC performance is critical. • Existing GCs developed for SMPs. What? • Assess GC scalability : Empirical Results. • Possible factors affecting the GC scalability. • Our approach to fixing them.

  3. Contemporary Architecture C0 C1 15 C5 C0 C1 C5 40 L2 L2 L2 L2 L2 L2 125 L3 L3 315 Node 0 Node 1 DRAM DRAM Non Uniform Memory Access (NUMA) Remote access >> Local access Our machine has 8 such nodes with 6 cores each

  4. GC Scalability (Lusearch) HotSpot JVM’s Garbage Collectors Application Threads GC Threads Application Time Pause Time Pause time increases with GC threads  Negative Scalability!

  5. Trivial Bottleneck • Scalable synchronization primitives are vital. • GC task queue uses a monitor • Unnecessarily blocks GC threads. • Replaced with lock-free version. • No barrier for GC threads after GC completion. • Trivial but very important: Up to 80% improvement.

  6. Main Bottleneck • Remote access and … Remote access! • 7 out of 8 accesses are remote • When scanning an object (87.7% remote) • When copying an object (82.7% remote) • When stealing for load balancing (2-4 bus ops/steal)

  7. Our Approach: Big Picture • Improve GC locality • Local Scan • Local Copy • Local Stealing • Tradeoff: • Locality vs. Load Balance • Fix young generation of ParallelScavenge.

  8. Avoid Remote Access b a f e c d Node 1 Node 0 GC0 GC1 From e Ref. Q from 0 to 1 a b c d e f Node 1 Node 0 To

  9. Heap Partitioning Baseline design = nMB Collect when full NUMA-aware space = n/2MB = n/2MB Chunk 0: only ¼ full Chunk 1: full Problem: Collect more often when even 1 chunk is full

  10. Heap Partitioning: Our Approach = nMB Chunk 0 = nMB Chunk 1 Collect when total = nMB

  11. Load Balancing • NUMA-aware work stealing • A thread only steals from local threads on the same node. • What about inter-node imbalance? • Apps with master-slave design cause this • Example: h2 database

  12. Master’s stack Some slave’s stack b a c d Node 1 Node 0 GC1 GC0 From Ref Q from 0 to 1 a c b d Node 1 Node 0 To

  13. Conclusion and Future Work • Remote access hinders the scalability of GC. • Tradeoff: Locality vs. Load Balance • Inter-node imbalance acts as a hurdle. • Using all the cores is sub-optimal • Hits the memory wall. • Adaptive resizing of NUMA-aware generation costs more! • Up to 65% on scalable benchmarks of DaCapo.

More Related