NUMA Tuning for Java Server Applications

NUMA Tuning for Java Server Applications Mustafa M. Tikir

Introduction • Cache-coherent SMPs are widely used • High performance computing • Large-scale applications • Client-server computing • cc-NUMA is the dominant architecture • Allows construction of large servers • Data locality is an important consideration • Faster access to local memory units

Dynamic Page Migration • Effective for scientific applications • Regular memory access patterns • Large static arrays with many pages • Divided into segments • Distributed to multiple computation nodes • A few nodes access each data segment most • Our earlier work • Moved pages at fixed time intervals • Profiles gathered from hardware counters • Resulted in up to • 90% reduction in non-local accesses • 16% improvement in execution times

Java Server Applications • Java programs • Make extensive use of heap-allocated memory • Typically have significant pointer chasing • Dynamic page migration may not be as beneficial • A page may have objects with different access patterns • Page placement is transparent to the standard allocation routines • Larger page size increases the likelihood • cc-NUMA servers tend to use super pages • Heap objects should be allocated or moved • Local to the processor accessing them most • Migration at the object granularity

Page Migration for SPECjbb2000 • Around 25% reduction in non-local accesses • Unlike scientific applications where it is up to 90% • Around 3% reduction in throughput • Overhead due to migrations of many pages

Memory Behavior at Object Granularity • Source code instrumentation of HotSpot VM • Object allocations by the Java application • Internal heap allocations by the VM • Changes in object addresses due to garbage collection • Instrumentation using dyninst • Additional helper thread • For address transaction sampling • Via Sun Fire Link hardware counters • Execution is divided into distinct intervals • Execution interval • Gathers information on object allocations and accesses • Garbage collection interval • Dumps allocation and transaction buffers

Experiments using SPECjbb2000 • Young Generation • Objects are initially allocated • Objects stay in until old enough to be tenured • Survivor spaces • Tenured (old) Generation • Objects reaching a certain age are promoted • Permanent Generation • The reflective data of the VM are allocated • Such as class and method objects

Potential Optimizations • Estimation study using finer grained techniques • Based on information gathered during measurement • Heap allocations and accesses • Potential object centric optimizations • Static-optimal placement • Has information on all object accesses • Places objects at allocation time • Prior-knowledge placement • Has information on object accesses during the next execution interval • Moves objects at garbage collection time • Object-migration placement • Gathers information since the start of execution • Moves objects at garbage collection time

Estimation Study Results • Migration is effective in old generation • Many objects in young generation die fast • One or a few processors access objects in young generation • Majority of accesses are from the allocator processor • SPECjbb2000 has some dynamically changing memory behavior in the old generation

NUMA-Aware Java Heaps • NUMA-Aware heap layouts • NUMA-Eden • NUMA-Aware young generation • Original old generation • Focus on the objects in the young generation • NUMA-Eden+Old • NUMA-Aware young generation • NUMA-Aware old generation • Combined with dynamic object migration • Focus on the access locality to all objects

NUMA-Aware Young Generation • We divide eden space into segments • Each locality group is assigned a segment • Pages in each segment are placed local to the group • Object allocation • The requestor processor is identified • Object is placed in the segment of the processor’s group • Garbage collection • When a segment does not have enough space • Other segments are also collected even if not full • Potentially eliminates future synchronization

NUMA-Aware Old Generation • We divide tenured space into segments • Each locality group is assigned a segment • Pages in each segment are placed local to the group • When an object is promoted to old generation • Preferred location of the object is identified • Processor that accesses the object most • Object is moved to the segment of the processor's group • Object migrations during full garbage collection • Preferred locations of all objects are re-computed • Additional object migrations • At every fixed number of minor collections • To match the dynamically changing behavior

Experimental Setup • Representative Java workloads for simulation • Generated from actual runs • Sequence of requests • To allocate or access objects by processors • Same order as the actual run • Workload Execution Machine • A hybrid execution simulator • Consumes the generated parallel workload • Issues memory allocations and accesses to the machine • Implements the underlying memory management algorithms • Original algorithms in the HotSpot VM • Algorithms for NUMA-Aware heap layouts

NUMA-Aware Heap Experiments • Application • SPECjbb2000 benchmark on HotSpot VM • Run with 12 warehouses • Platform • 24 processor Sun Fire 6800 • 24 GB main memory • Sampling at every 1K transactions • Partial workload from the actual run • 10M allocation records • 28M memory accesses • Generated workloads with higher pressure • Scaled 16 and 32 times

Reduction in Non-Local Accesses

Execution Times • NUMA-aware heaps are effective • 27% improvement for NUMA-Eden configuration • 40% improvement for NUMA-Eden+Old configuration • More effective for higher memory pressure

Conclusions • NUMA-Aware heap layouts • Up to 41% reduction in non-local accesses • Up to 40% improvement in workload execution • Dynamic object migration is beneficial • Compared to using only NUMA-aware young generation • NUMA-aware heaps are more effective • As the memory pressure increases • More effective on larger servers • Sun Fire 15K (latency ratio => 1:1.78) • SGI Altix 3000 (latency ratio => 1:4.17)

NUMA Tuning for Java Server Applications