1 / 17

NUMA Tuning for Java Server Applications

NUMA Tuning for Java Server Applications. Mustafa M. Tikir. Introduction. Cache-coherent SMPs are widely used High performance computing Large-scale applications Client-server computing cc-NUMA is the dominant architecture Allows construction of large servers

hedwig
Download Presentation

NUMA Tuning for Java Server Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NUMA Tuning for Java Server Applications Mustafa M. Tikir

  2. Introduction • Cache-coherent SMPs are widely used • High performance computing • Large-scale applications • Client-server computing • cc-NUMA is the dominant architecture • Allows construction of large servers • Data locality is an important consideration • Faster access to local memory units

  3. Dynamic Page Migration • Effective for scientific applications • Regular memory access patterns • Large static arrays with many pages • Divided into segments • Distributed to multiple computation nodes • A few nodes access each data segment most • Our earlier work • Moved pages at fixed time intervals • Profiles gathered from hardware counters • Resulted in up to • 90% reduction in non-local accesses • 16% improvement in execution times

  4. Java Server Applications • Java programs • Make extensive use of heap-allocated memory • Typically have significant pointer chasing • Dynamic page migration may not be as beneficial • A page may have objects with different access patterns • Page placement is transparent to the standard allocation routines • Larger page size increases the likelihood • cc-NUMA servers tend to use super pages • Heap objects should be allocated or moved • Local to the processor accessing them most • Migration at the object granularity

  5. Page Migration for SPECjbb2000 • Around 25% reduction in non-local accesses • Unlike scientific applications where it is up to 90% • Around 3% reduction in throughput • Overhead due to migrations of many pages

  6. Memory Behavior at Object Granularity • Source code instrumentation of HotSpot VM • Object allocations by the Java application • Internal heap allocations by the VM • Changes in object addresses due to garbage collection • Instrumentation using dyninst • Additional helper thread • For address transaction sampling • Via Sun Fire Link hardware counters • Execution is divided into distinct intervals • Execution interval • Gathers information on object allocations and accesses • Garbage collection interval • Dumps allocation and transaction buffers

  7. Experiments using SPECjbb2000 • Young Generation • Objects are initially allocated • Objects stay in until old enough to be tenured • Survivor spaces • Tenured (old) Generation • Objects reaching a certain age are promoted • Permanent Generation • The reflective data of the VM are allocated • Such as class and method objects

  8. Potential Optimizations • Estimation study using finer grained techniques • Based on information gathered during measurement • Heap allocations and accesses • Potential object centric optimizations • Static-optimal placement • Has information on all object accesses • Places objects at allocation time • Prior-knowledge placement • Has information on object accesses during the next execution interval • Moves objects at garbage collection time • Object-migration placement • Gathers information since the start of execution • Moves objects at garbage collection time

  9. Estimation Study Results • Migration is effective in old generation • Many objects in young generation die fast • One or a few processors access objects in young generation • Majority of accesses are from the allocator processor • SPECjbb2000 has some dynamically changing memory behavior in the old generation

  10. NUMA-Aware Java Heaps • NUMA-Aware heap layouts • NUMA-Eden • NUMA-Aware young generation • Original old generation • Focus on the objects in the young generation • NUMA-Eden+Old • NUMA-Aware young generation • NUMA-Aware old generation • Combined with dynamic object migration • Focus on the access locality to all objects

  11. NUMA-Aware Young Generation • We divide eden space into segments • Each locality group is assigned a segment • Pages in each segment are placed local to the group • Object allocation • The requestor processor is identified • Object is placed in the segment of the processor’s group • Garbage collection • When a segment does not have enough space • Other segments are also collected even if not full • Potentially eliminates future synchronization

  12. NUMA-Aware Old Generation • We divide tenured space into segments • Each locality group is assigned a segment • Pages in each segment are placed local to the group • When an object is promoted to old generation • Preferred location of the object is identified • Processor that accesses the object most • Object is moved to the segment of the processor's group • Object migrations during full garbage collection • Preferred locations of all objects are re-computed • Additional object migrations • At every fixed number of minor collections • To match the dynamically changing behavior

  13. Experimental Setup • Representative Java workloads for simulation • Generated from actual runs • Sequence of requests • To allocate or access objects by processors • Same order as the actual run • Workload Execution Machine • A hybrid execution simulator • Consumes the generated parallel workload • Issues memory allocations and accesses to the machine • Implements the underlying memory management algorithms • Original algorithms in the HotSpot VM • Algorithms for NUMA-Aware heap layouts

  14. NUMA-Aware Heap Experiments • Application • SPECjbb2000 benchmark on HotSpot VM • Run with 12 warehouses • Platform • 24 processor Sun Fire 6800 • 24 GB main memory • Sampling at every 1K transactions • Partial workload from the actual run • 10M allocation records • 28M memory accesses • Generated workloads with higher pressure • Scaled 16 and 32 times

  15. Reduction in Non-Local Accesses

  16. Execution Times • NUMA-aware heaps are effective • 27% improvement for NUMA-Eden configuration • 40% improvement for NUMA-Eden+Old configuration • More effective for higher memory pressure

  17. Conclusions • NUMA-Aware heap layouts • Up to 41% reduction in non-local accesses • Up to 40% improvement in workload execution • Dynamic object migration is beneficial • Compared to using only NUMA-aware young generation • NUMA-aware heaps are more effective • As the memory pressure increases • More effective on larger servers • Sun Fire 15K (latency ratio => 1:1.78) • SGI Altix 3000 (latency ratio => 1:4.17)

More Related