Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA

Garbage Collection for Large Scale Multiprocessors(Funded by ANR projects: Prose and ConcoRDanT) LokeshGidraGaël Thomas JulienSopena Marc Shapiro Regal-LIP6/INRIA

Introduction Why? • Heavy use of Managed Runtime Environments • Application servers • Scientific applications • Example: Jboss, Sunflow etc. • Hardware is more and more multi-resourced. • GC performance is critical. • Existing GCs developed for SMPs. What? • Assess GC scalability : Empirical Results. • Possible factors affecting the GC scalability. • Our approach to fixing them.

Contemporary Architecture C0 C1 15 C5 C0 C1 C5 40 L2 L2 L2 L2 L2 L2 125 L3 L3 315 Node 0 Node 1 DRAM DRAM Non Uniform Memory Access (NUMA) Remote access >> Local access Our machine has 8 such nodes with 6 cores each

GC Scalability (Lusearch) HotSpot JVM’s Garbage Collectors Application Threads GC Threads Application Time Pause Time Pause time increases with GC threads  Negative Scalability!

Trivial Bottleneck • Scalable synchronization primitives are vital. • GC task queue uses a monitor • Unnecessarily blocks GC threads. • Replaced with lock-free version. • No barrier for GC threads after GC completion. • Trivial but very important: Up to 80% improvement.

Main Bottleneck • Remote access and … Remote access! • 7 out of 8 accesses are remote • When scanning an object (87.7% remote) • When copying an object (82.7% remote) • When stealing for load balancing (2-4 bus ops/steal)

Our Approach: Big Picture • Improve GC locality • Local Scan • Local Copy • Local Stealing • Tradeoff: • Locality vs. Load Balance • Fix young generation of ParallelScavenge.

Avoid Remote Access b a f e c d Node 1 Node 0 GC0 GC1 From e Ref. Q from 0 to 1 a b c d e f Node 1 Node 0 To

Heap Partitioning Baseline design = nMB Collect when full NUMA-aware space = n/2MB = n/2MB Chunk 0: only ¼ full Chunk 1: full Problem: Collect more often when even 1 chunk is full

Heap Partitioning: Our Approach = nMB Chunk 0 = nMB Chunk 1 Collect when total = nMB

Load Balancing • NUMA-aware work stealing • A thread only steals from local threads on the same node. • What about inter-node imbalance? • Apps with master-slave design cause this • Example: h2 database

Master’s stack Some slave’s stack b a c d Node 1 Node 0 GC1 GC0 From Ref Q from 0 to 1 a c b d Node 1 Node 0 To

Conclusion and Future Work • Remote access hinders the scalability of GC. • Tradeoff: Locality vs. Load Balance • Inter-node imbalance acts as a hurdle. • Using all the cores is sub-optimal • Hits the memory wall. • Adaptive resizing of NUMA-aware generation costs more! • Up to 65% on scalable benchmarks of DaCapo.

Lokesh Gidra Gaël Thomas Julien Sopena Marc Shapiro Regal-LIP6/INRIA