DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers

DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities) 3/05/2010

Dynamo • Dynamo is a dynamic optimizer • It won the best paper award in PLDI’2000, cited 612times • Work started by the HP lab and the HP system lab. • MIT took over and ported it to x86, called it DynamoRIO. This group later started a new company, called Determina (now acquired by VMware) • Considered revolutionary since optimizations were always performed statically (i.e. at compile time)

SPEC CINT2006 for Opteron X4 Time=CPI x Inst x Clock period Very high cache miss rate rates Ideal CPI should be 0.33

Where have all the cycles gone? • Cache misses • Capacity, Compulsory/Cold, Conflict, Coherence • I-cache and D-cache • TLB misses • Branch mis-predictions • Static and dynamic prediction • Mis-speculation • Pipeline stalls • Ineffective code scheduling often caused by memory aliasing Unpredictable Hard to deal with at compile time

Trend of Multi-cores Intel Core i7 die photo Exploiting these potentials demands thread-level parallelism

Exploiting Thread-Level Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Compiler gives up Thread-LevelSpeculation (TLS)  p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88   Speculation Failure Load 88 Parallel execution But Unpredictable Potentially more parallelism with speculation

Dynamic Optimizers Dynamic optimizers Dynamic Binary Optimizers (DBO) Java VM (JVM) with JIT compiler (dynamic compilation or adaptive optimization) Native-to-native dynamic binary optimizers (x86 x86, x86-32  x86-64 IA64 IA64) Non-native dynamic binary translators (e.g. x86  IA64, ARM  MIPS, PPC  x86, QEMU Vmware, Rosetta)

More on why dynamic binary optimization New architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary. x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, … Software evolution and ISV behaviors reduce effectiveness of traditional static optimizer DLL, middleware, binary distribution, … Profile sensitive optimizations would be more effective if performed at runtime predication, speculation, branch prediction, prefetching Multi-core environment with dynamic resource sharing makes static optimization challenging shared cache, off-chip bandwidth, shared FU’s

How Dynamo Works Interpret until taken branch Lookup branch target Start of trace condition? Jump to code cache Increment counter for branch target Counter exceed threshold? Interpret + code gen Signal handler Code Cache End-of-trace condition? Emit into cache Create trace & optimize it Dynamo is VM based

Trace Selection trace selection A A trace layout in trace cache C B C D F D call G F E I G H E I back to runtime to B return to H

Backpatching A When H becomes hot, a new trace is selected starting from H, and the trace exit branch in block F is backpatched to branch to the new trace. C D H F I G E I E back to runtime to B to H

Execution Migrates to Code Cache interpreter/ emulator 1 0 2 trace selector 3 1 a.out 4 2 optimizer 3 Code cache

Trace Based Optimizations • Full and partial redundancy elimination • Dead code elimination • Trace scheduling • Instruction cache locality improvement • Dynamic procedure inlining (or procedure outlining) • Some loop based optimizations

Summary of Dynamo • Dynamic Binary Optimization customizes performance delivery: • Code is optimized by how the code is used • Dynamic trace formation and trace-based optimizations • Code is optimized for the machine it runs on • Code is optimized when all executables are available • Code should be optimized only the parts that really matters

ADORE • ADORE means ADaptive Object code RE-optimization • Was developed at the CSE department, U. of Minnesota, Twin Cities • Applied a very different model for dynamic optimization systems • Considered evolutionary, cited by 61

Dynamic Binary Optimizer’s Models Application Binaries Application Binaries DBO DBO Operating System Operating System Hardware Platform Hardware Platform • Translate only hot execution paths and keep in code cache • Lower overhead • ADORE (IA64, SPARC) • COBRA (IA64, x86 – ongoing) • Translate mostexecution paths and keep in code cache • Easy to maintain control • Dynamo (PA-RISC) • DynamoRIO (x86)

ADORE Framework Patch traces Code Cache Deployment Init Code $ Optimized Traces Main Thread Dynamic Optimization Thread Optimization Pass traces to opt Trace Selection On phase change Phase Detection Int on K-buffer ovf Kernel Init PMU Int. on Event Hardware Performance Monitoring Unit (PMU)

Thread Level View Thread 1 Thread 2 Init ADORE User buffer full is maintained for 1 main event. This event is usually CPU_CYCLES sleep User buffer full Application ADORE invoked sleep User buffer full K-buffer overflow handler ADORE invoked

Perf. of ADORE/Itanium on SPEC2000

Performance on BLAST

ADORE vs. Dynamo

ADORE on Multi-cores • COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines. • ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines. • ADORE for TLS tuning

COBRA Framework Optimization Thread Centralized Control Initialization Trace Selection Trace Optimization Trace Patching Monitor Threads Localized Control Per-thread Profile 23

Startup of 4 thread OpenMP Program 24

Prefetch vs. NoPrefetch The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls. 26% 34% 25

Prefetch vs. Prefetch with .excl .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol) 15% 12% 26

Execution time on 4-way SMP • noprefetch: up to 15%, average 4.7% speedup • prefetch.excl: up to 8%, average 2.7% speedup 27

Execution time on cc-NUMA • noprefetch: up to 68%, average 17.5% speedup • prefetch.excl: up to 18%, average 8.5% speedup 28

Summary of Results from COBRA We showed that coherent misses caused by aggressive prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors. With the guide of runtime profile, we experimented two optimizations. Reducing aggressiveness of prefetching Up to 15%, average 4.7% speedup on 4-way SMP Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA Using exclusive hint for prefetch Up to 8%, average 2.7% speedup on 4-way SMP Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA 29

ADORE/SPARC ADORE has been ported to Sparc/Solaris platform since 2005. Some porting issues: ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead Reachability is a true problem. (e.g. Oracle, Dyna3D) Lack of branch trace buffer is painful. (e.g. Blast)

Performance of In-Thread Opt. (USIII+)

time Helper Thread Prefetching for Multi-Core  First Core L2 Cache Miss Main thread Cache miss avoided Trigger to activate (About 65 cycles delay) Second core Prefetches initiated Spin Waiting Spin again waiting for the next trigger

Performance of Dynamic Helper Thread(on Sun UltraSparc IV+)

C C C C C Evaluation Environment for TLS Benchmarks • SPEC2000 written in C, -O3 optimization Underlying architecture • 4-core, chip-multiprocessor (CMP) • speculation supported by coherence Simulator • Superscalar with detailed memory model • simulates communication latency • models bandwidth and contention P P P P Interconnect Detailed, cycle-accurate simulation

Dynamic Tuning for TLS 1.37x 1.23x 1.17x Parallel Code Overhead

Summary of ADORE • ADORE uses Hardware Performance Monitoring (HPM) capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers. • ADORE can speed up real-world large applications optimized by production compilers. • ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86. • ADORE/COBRA can also optimize for multi-cores. • ADORE has recently been applied to dynamic TLS tuning.

Conclusion “It was the best of times, it was the worst of times…” -- opening line of “A Tale of Two Cities” best of times for research: new areas where innovations are needed worst of times for research: saturated area where technologies are matured or well-understood, hard to innovate, …

DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers