250 likes | 345 Views
Using Platform-Specific Performance Counters for Dynamic Compilation. Florian Schneider and Thomas Gross ETH Zurich. Introduction & Motivation. Dynamic compilers common execution platform for OO languages (Java, C#) Properties of OO programs difficult to analyze at compile-time
E N D
Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich
Introduction & Motivation • Dynamic compilers common execution platform for OO languages (Java, C#) • Properties of OO programs difficult to analyze at compile-time • JIT compiler can immediately use information obtained at run-time
Introduction & Motivation Types of information: • Profiles: e.g. execution frequency of methods / basic blocks • Hardware-specific properties: cache misses, TLB misses, branch prediction failures
Outline • Introduction • Requirements • Related work • Implementation • Results • Conclusions
Requirements • Infrastructure flexible enough to measure different execution metrics • Hide machine-specific details from VM • Keep changes to the VM/compiler minimal • Runtime overhead of collecting information from the CPU low • Information must be precise to be useful for online optimization
Related work • Profile guided optimization • Code positioning [PettisPLDI90] • Hardware performance monitors • Relating HPM data to basic blocks [Ammons PLDI97] • “Vertical profiling” [Hauswirth OOPSLA 2004] • Dynamic optimization • Mississippi delta [Adl-Tabatabai PLDI2004] • Object reordering [Huang OOPSLA 2004] • Our work: • No instrumentation • Use profile data + hardware info • Targets fully automatic dynamic optimization
Hardware performance monitors • Sampling-based counting • CPU reports state every n events • Precision platform-dependent (pipelines, out-of-order execution) • Sampling provides method, basic block, or instruction-level information • Newer CPUs support precise sampling (e.g. P4, Itanium)
Hardware performance monitors • Way to localize performance bottlenecks • Sampling interval determines how fine-grained the information is • Smaller sampling interval more data • Trade-off: precision vs. runtime overhead • Need enough samples for a representative picture of the program behavior
Implementation Main parts • Kernel module: low level access to hardware, per process counting • User-space library: hides kernel & device driver details from VM • Java VM thread: collects samples periodically, maps samples to Java code • Implemented on top of Jikes RVM
Implementation • Supported events: • L1 and L2 cache misses • DTLB misses • Branch prediction • Parameters of the monitoring module: • Buffer size (fixed) • Polling interval (fixed) • Sampling interval (adaptive) • Keep runtime overhead constant by changing interval during run-time automatically
From raw data to Java • Determine method + bytecode instr • Build sorted method table • Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL
From raw data to Java • Sample gives PC + register contents • PC machine code compiled Java code bytecode instruction • For data address: use registers + machine code to calculate target address: • GETFIELD indirect load mov 12(eax), eax // 12 = offset of field
Engineering issues • Lookup of PC to get method / BC instr must be efficient • Done in parallel with user program • Use binary search / hash table • Update at recompilation, GC • Identify 100% of instructions (PCs): • Include samples from application, VM, and library code • Dealing with native parts
Infrastructure • Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform • Pentium 4, 3 GHz, 1G RAM, 1M L2 cache • Measured data show: • Runtime overhead • Extraction of meaningful information
Runtime overhead • Experiment setup: monitor L2 cache misses
Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles
Measurements • Measure which instructions produce most events (cache misses, branch mispred) • Potential for data locality and control flow optimizations • Compare different spec-benchmarks • Find “hot spots”: instructions that produce 80% of all measured events
L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)
L1/L2 Cache misses 80% quantile = 477 (N=8526) 80% quantile = 76 (N=2361)
L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)
Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)
Summary • Distribution of events over program differ significantly between benchmarks • Challenge: Are data precise enough to guide optimizations in a dynamic compiler?
Further work • Apply information in optimizer • Data: access path expressions p.x.y • Control flow: inlining, I-cache locality • Investigate flexible sampling interval • Further optimizations of monitoring system • Replacing expensive JNI calls • Avoid copying of samples
Concluding remarks • Precise performance event monitoring is possible with low overhead (~ 2%) • Monitoring infrastructure tied into Jikes RVM compiler • Instruction level information allows optimizations to focus on “hot spots” • Good platform to study coupling compiler decisions to hardware-specific platform properties