380 likes | 612 Views
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities ) 3/05/2010. Dynamo. Dynamo is a dynamic optimizer It won the best paper award in PLDI’2000, cited 612 times
E N D
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers Wei Chung Hsu 徐慰中 Computer Science Department 交通大學 (work was done in University of Minnesota, Twin Cities) 3/05/2010
Dynamo • Dynamo is a dynamic optimizer • It won the best paper award in PLDI’2000, cited 612times • Work started by the HP lab and the HP system lab. • MIT took over and ported it to x86, called it DynamoRIO. This group later started a new company, called Determina (now acquired by VMware) • Considered revolutionary since optimizations were always performed statically (i.e. at compile time)
SPEC CINT2006 for Opteron X4 Time=CPI x Inst x Clock period Very high cache miss rate rates Ideal CPI should be 0.33
Where have all the cycles gone? • Cache misses • Capacity, Compulsory/Cold, Conflict, Coherence • I-cache and D-cache • TLB misses • Branch mis-predictions • Static and dynamic prediction • Mis-speculation • Pipeline stalls • Ineffective code scheduling often caused by memory aliasing Unpredictable Hard to deal with at compile time
Trend of Multi-cores Intel Core i7 die photo Exploiting these potentials demands thread-level parallelism
Exploiting Thread-Level Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Compiler gives up Thread-LevelSpeculation (TLS) p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88 Speculation Failure Load 88 Parallel execution But Unpredictable Potentially more parallelism with speculation
Dynamic Optimizers Dynamic optimizers Dynamic Binary Optimizers (DBO) Java VM (JVM) with JIT compiler (dynamic compilation or adaptive optimization) Native-to-native dynamic binary optimizers (x86 x86, x86-32 x86-64 IA64 IA64) Non-native dynamic binary translators (e.g. x86 IA64, ARM MIPS, PPC x86, QEMU Vmware, Rosetta)
More on why dynamic binary optimization New architecture/micro-architecture features offer more opportunity for performance, but are not effectively exploited by legacy binary. x86 P5/P6/PII/PIII, x86-32/x86-64, PA 7200/8000, … Software evolution and ISV behaviors reduce effectiveness of traditional static optimizer DLL, middleware, binary distribution, … Profile sensitive optimizations would be more effective if performed at runtime predication, speculation, branch prediction, prefetching Multi-core environment with dynamic resource sharing makes static optimization challenging shared cache, off-chip bandwidth, shared FU’s
How Dynamo Works Interpret until taken branch Lookup branch target Start of trace condition? Jump to code cache Increment counter for branch target Counter exceed threshold? Interpret + code gen Signal handler Code Cache End-of-trace condition? Emit into cache Create trace & optimize it Dynamo is VM based
Trace Selection trace selection A A trace layout in trace cache C B C D F D call G F E I G H E I back to runtime to B return to H
Backpatching A When H becomes hot, a new trace is selected starting from H, and the trace exit branch in block F is backpatched to branch to the new trace. C D H F I G E I E back to runtime to B to H
Execution Migrates to Code Cache interpreter/ emulator 1 0 2 trace selector 3 1 a.out 4 2 optimizer 3 Code cache
Trace Based Optimizations • Full and partial redundancy elimination • Dead code elimination • Trace scheduling • Instruction cache locality improvement • Dynamic procedure inlining (or procedure outlining) • Some loop based optimizations
Summary of Dynamo • Dynamic Binary Optimization customizes performance delivery: • Code is optimized by how the code is used • Dynamic trace formation and trace-based optimizations • Code is optimized for the machine it runs on • Code is optimized when all executables are available • Code should be optimized only the parts that really matters
ADORE • ADORE means ADaptive Object code RE-optimization • Was developed at the CSE department, U. of Minnesota, Twin Cities • Applied a very different model for dynamic optimization systems • Considered evolutionary, cited by 61
Dynamic Binary Optimizer’s Models Application Binaries Application Binaries DBO DBO Operating System Operating System Hardware Platform Hardware Platform • Translate only hot execution paths and keep in code cache • Lower overhead • ADORE (IA64, SPARC) • COBRA (IA64, x86 – ongoing) • Translate mostexecution paths and keep in code cache • Easy to maintain control • Dynamo (PA-RISC) • DynamoRIO (x86)
ADORE Framework Patch traces Code Cache Deployment Init Code $ Optimized Traces Main Thread Dynamic Optimization Thread Optimization Pass traces to opt Trace Selection On phase change Phase Detection Int on K-buffer ovf Kernel Init PMU Int. on Event Hardware Performance Monitoring Unit (PMU)
Thread Level View Thread 1 Thread 2 Init ADORE User buffer full is maintained for 1 main event. This event is usually CPU_CYCLES sleep User buffer full Application ADORE invoked sleep User buffer full K-buffer overflow handler ADORE invoked
ADORE on Multi-cores • COBRA (Continuous Object code Re-Adaptation) framework is a follow up project, implemented on Itanium Montecito and x86’s new multi-core machines. • ADORE on SPARC Panther (Ultra Sparc IV+) multi-core machines. • ADORE for TLS tuning
COBRA Framework Optimization Thread Centralized Control Initialization Trace Selection Trace Optimization Trace Patching Monitor Threads Localized Control Per-thread Profile 23
Prefetch vs. NoPrefetch The prefetch version when running with 4 threads suffers significantly from L2_OZQ_FULL stalls. 26% 34% 25
Prefetch vs. Prefetch with .excl .excl hint: prefetch a cache line in exclusive state instead of shared state. (Invalidation based cache coherence protocol) 15% 12% 26
Execution time on 4-way SMP • noprefetch: up to 15%, average 4.7% speedup • prefetch.excl: up to 8%, average 2.7% speedup 27
Execution time on cc-NUMA • noprefetch: up to 68%, average 17.5% speedup • prefetch.excl: up to 18%, average 8.5% speedup 28
Summary of Results from COBRA We showed that coherent misses caused by aggressive prefetching could limit the scalability of multithreaded program on scalable shared memory multiprocessors. With the guide of runtime profile, we experimented two optimizations. Reducing aggressiveness of prefetching Up to 15%, average 4.7% speedup on 4-way SMP Up to 68%, average 17.5% speedup on SGI Altix cc-NUMA Using exclusive hint for prefetch Up to 8%, average 2.7% speedup on 4-way SMP Up to 18%, average 8.5% speedup on SGI Altix cc-NUMA 29
ADORE/SPARC ADORE has been ported to Sparc/Solaris platform since 2005. Some porting issues: ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead Reachability is a true problem. (e.g. Oracle, Dyna3D) Lack of branch trace buffer is painful. (e.g. Blast)
time Helper Thread Prefetching for Multi-Core First Core L2 Cache Miss Main thread Cache miss avoided Trigger to activate (About 65 cycles delay) Second core Prefetches initiated Spin Waiting Spin again waiting for the next trigger
C C C C C Evaluation Environment for TLS Benchmarks • SPEC2000 written in C, -O3 optimization Underlying architecture • 4-core, chip-multiprocessor (CMP) • speculation supported by coherence Simulator • Superscalar with detailed memory model • simulates communication latency • models bandwidth and contention P P P P Interconnect Detailed, cycle-accurate simulation
Dynamic Tuning for TLS 1.37x 1.23x 1.17x Parallel Code Overhead
Summary of ADORE • ADORE uses Hardware Performance Monitoring (HPM) capability to implement a light weight runtime profiling system. Efficient profiling and phase detection is the key to the success of dynamic native binary optimizers. • ADORE can speed up real-world large applications optimized by production compilers. • ADORE works on two architectures: Itanium and SPARC. COBRA is a follow-up system of ADORE. It works on Itanium and x86. • ADORE/COBRA can also optimize for multi-cores. • ADORE has recently been applied to dynamic TLS tuning.
Conclusion “It was the best of times, it was the worst of times…” -- opening line of “A Tale of Two Cities” best of times for research: new areas where innovations are needed worst of times for research: saturated area where technologies are matured or well-understood, hard to innovate, …