540 likes | 849 Views
DYNAMO vs. ADORE A Tale of Two Dynamic Optimizers. Wei Chung Hsu Computer Science and Engineering Department University of Minnesota, Twin Cities. Dynamic Binary Optimization.
E N D
DYNAMO vs. ADOREA Tale of Two Dynamic Optimizers Wei Chung Hsu Computer Science and Engineering Department University of Minnesota, Twin Cities
Dynamic Binary Optimization • It is the detection of program hot spots and application of optimizations to native binarycode at run-time. Also called runtime binary optimization. • Why is static compiler optimization insufficient?
Why Dynamic Binary Optimization • One size does not fit all: runtime environments may be different from what the static binary was optimized for. • Underlying micro-architectures e.g. running Pentium code on Pentium-II • Input data sets e.g.some data sets may not incur cache misses • Dynamic phase behavior • Dynamic libraries
Portable Executable Compile Info .EXE or .SO Intermediate Representation
Common Binary (fat binary) Itanium-I binary Itanium-2 binary Itanium-3 binary Annotation Annotation Annotation
Chubby Binary Itanium-I specific Itanium-2 specific Itanium-3 specific Annotation Annotation Annotation Common Itanium Binary
Using More Accurate Profiles Optimize from Source @ ISV Sites Optimize from Source with profile feedback Optimize from binary with profile feedback Walk time (or Ahead of time) Optimization @ User Sites Runtime Optimization
Dynamo • Dynamo means “Dynamic Optimization System” • A collaborative project between HP Lab (under Josh Fisher) and HP System Lab. • Build on the dynamic translation technology developed for ARIES (which migrates PA binary to the Itanium architecture). • Considered revolutionary and won the best paper award in PLDI’2000 • Dynamo technology was enhanced and continued by MIT and later became Dynamo/RIO. • Dynamo/RIO group now starts a company called Determina (http://www.determina.com/)
Migration vs. Dynamic Optimization Migration (e.g. Aries) DynOpt (e.g. Dynamo) existing Incompatible binary native binary emulator/ interpreter emulator/ interpreter trace selector translator dyncode cache code cache optimizer Memory Memory
Migration vs. Dynamic Optimization Migration DynOpt (e.g. Dynamo) existing Incompatible binary native binary emulator/ interpreter emulator/ interpreter trace selector translator dyncode cache code cache optimizer Optional Accelerator Optimization is 2nd priority Optional Optimizer Optimization is critical
Why not Static Binary Translation? • The Code-Discovery Problem • What is the target of an indirect jump? • No guarantee that the locations immediately following a jump contain valid instructions • Some compilers intersperse data with instructions • More challenging for ISA with variable length instructions • padding to align instructions • The Code-Location Problem • How to translate indirect jumps? The target is not known until runtime. • Other problems • Self-modifying code • Self-referencing code • Precise traps
How Dynamo Works Interpret until taken branch Lookup branch target Start of trace condition? Jump to code cache Increment counter for branch target Counter exceed threshold? Interpret + code gen Signal handler Code Cache End-of-trace condition? Emit into cache Create trace & optimize it
Trace Selection A trace selection A B original layout C B C D E D call F E F G H G H I I return
Trace Selection trace selection A A Trace layout in trace cache C B C D F D call G F E I G H E I back to runtime to B return to H
Flow of Control on Translated Traces Translated Trace Emulation Manager Stub Stub Translated Trace Stub Stub Translated Trace Stub Stub
Flow of Control on Translated Traces High overhead Translated Trace Emulation Manager Stub Stub Translated Trace Stub Stub Translated Trace Stub Stub
Translation Linking Translated Trace Emulation Manager Stub Stub Translated Trace Stub Stub Translated Trace Stub Stub
Backpatching/Trace Linking A C When H becomes hot, a new trace is selected starting from H, and the trace exit branch in block F is backpatched to branch to the new trace. D H F I G E I E back to runtime to B to H
Importance of Trace Linking Performance slowdown when linking is disabled • Not a small trick
Execution Migrates to Code Cache interpreter/ emulator 1 0 2 trace selector 3 1 a.out 4 2 Emulation Manager optimizer 3 Code cache
cmp real_target, predicted_target je predicted_target jmp hashtable_lookup jmp hashtable_lookup Handle Indirect Branches • Variable targets – cannot be linked • Must map addresses in original program to the addresses in code cache • Hash table lookup • Compare the dynamic target with a predicted target
Handle Indirect Branches (cont.) • Compare with a small number of predicted targets. cmp real_target, hot_target_1 je hot_target_1 cmp real_target, hot_target_2 je hot_target_2 call prof_routine jmp hashtable_lookup jmp hashtable_lookup • A software-based indirect-branch-target-cache to avoid going back to the emulation manager.
Performance • Trace formation – Partial procedure inline & code layout • Slowdown • Major slowdowns were avoided by early bail-out
Summary of Dynamo • Dynamic Binary Optimization customizes performance delivery: • Code is optimized by how the code is used • Code is optimized for the machine it runs on • Code is optimized when all executables are available • Code is optimized only the part that matters
Dynamo Follow-ups • Dynamo/RIO: Dynamo + RIO (Runtime Introspection and Optimization) for x86 architecture • More successful in “Introspection” than in “Optimization”. • Started the company Determina for system security enforcement • Similar technology can be applied to migration, fast simulation, dynamic instrumentation, program introspection, security enforcement, power management, … etc.
What happen to “Optimization” Dynamo has the following challenges: • Profiling issues • frequency based, not time based • hard to detect really hot code, may end up with too much translation • Code duplication issues • trace generation could end up with excessive code duplication • Code cache management issues • for real applications, it requires very large code cache • Indirect branch handling issues • Indirect branch handling is expensive
ADORE • ADORE means ADaptive Object code RE-optimization • Was developed at the CSE department, U. of Minnesota • Applied a different model for dynamic optimization systems (after rethinking of dynamic optimization) • Considered evolutionary
ADORE Model Executable Code Cache Branch/jump instruction DynOpt manager
ADORE Rationale • If the executable is compatible, why should we use interpretation/emulation? • Instrumentation or interpretation based profiling does not collect important performance events, why not use HPM? • If a program runs well, why bother to translate hot code? • Redirection of execution can be more effectively implemented using branches.
ADORE Framework Patch traces Code Cache Deployment Init Code $ Optimized Traces Main Thread Dynamic Optimization Thread Optimization Pass traces to opt Trace Selection On phase change Phase Detection Int on K-buffer ovf Kernel Init PMU Int. on Event Hardware Performance Monitoring Unit (PMU)
Phase Detection History of avg PC values Compute average (E) and Standard Deviation (D) of PC values in history buffer M2 M4 M5 M1 M3 Band of tolerance is from E-D to E+D. If Mk is outside band a phase change is triggered
Phase Change Phase Detection History of avg PC values Compute average (E) and Standard Deviation (D) of PC values in history buffer M2 M4 M5 M1 M3 Band of tolerance is from E-D to E+D. If Mk is outside band a phase change is triggered
Phase Detection History of avg PC values Compute average (E) and Standard Deviation (D) of PC values in history buffer M2 M4 M5 M1 M3 Band of tolerance is from E-D to E+D. If Mk is outside band a phase change is triggered
Trace Selection • A trace is a single entry, multiple exit code sequence (e.g. a superblock) • Trace selection is guided by the path profile constructed from the branch trace samples (BTB samples). • Traces can be stitches together to form longer traces. • Trace end conditions: procedure return, backward branch that forms a loop, not highly biased branches, trace size exceeds a preset threshold. • Function calls are considered fall-through.
Runtime D-Cache Pre-fetching • Locate the most recent delinquent loads • If the load instruction is in a loop-type trace, determines the reference pattern via address dependence analysis. • Calculate the stride if the reference has spatial or structural locality. • If the reference is pointer-chasing, insert codes to detect possible strides at runtime. • Insert and schedule pre-fetch instructions.
Identify Delinquent Loads • Using sampled EAR information to identify the delinquent loads in a selected trace. • Calculate the average latency and the total miss penalty of each delinquent load. { .mii ldfd f60=[r15],8 // average latency: 129 penalty ratio: 6.38% add r8=16,r24;; add r42=8,r24 }
Determine Reference Pattern // i++; a[ i++]=b; // b= a[ i++]; Loop: … add r14= 4, r14 st4 [r14] = r20, 4 ld4 r20 = [r14] add r14 = 4, r14 … br.cond Loop // c = b[a[k++] – 1]; Loop: … ld4 r20=[r16], 4 add r15 = r25,r20 add r15 = –1, r15 ld8 r15=[r15] … br.cond Loop //tail = arcin-> tail; //arcin = tail-> mark; Loop: … add r11= 104, r34 ld8 r11= [r11] ld8 r34= [r11] … br.cond Loop A. direct array B. indirect array C. pointer chasing
O1 O1 O2 O2 O3 O3 Static prefetching ineffective Static Optimizations on BLAST • Performance can often degrade at higher optimization levels in all three compilers • Long query which has a high fraction of stall cycles did not benefit from static optimizations
Profile Based Optimizations • Less than 5% gain for some inputs • Large slowdown for others • Combining profiles results in moderate gain for some inputs
Slowdown from PBO • Large increase in system time • ECC inserts speculative load for future iteration in a loop, which causes TLB misses • TLB miss exception is handled by OS for speculative loads immediately • Reconfigured kernel to defer TLB miss on speculative loads to hardware • On TLB miss for speculative load, the NAT bit is set. Recovery code will load data if needed
PBO (Kernel Reconfigured) Difficult to find right set of combined training input PBO can give performance but has limitations
Mis-conceptions about ADORE • Compiler optimizations are very complex, doing them at runtime is a bad idea. • Current ADORE deals with only cache misses. It does not handle traditional compiler optimizations. (It is a complement, not a replacement, of compiler optimization) • Inserting cache prefetch instructions (and/or branch prediction hints) are safe optimizations. No correctness issues.
Performance at Different Sampling Rates(based on Adore/Itanium perf. of Spec2000)
Mis-conceptions about DynOpt • Compilation/Optimization overhead is usually amortized by thousands execution of the binary. How can runtime optimization overhead be amortized for only one execution?
Mis-conceptions about ADORE • ADORE will be unreliable, hard to debug, difficult to maintain. ADORE performs simple transformations, it could be more reliable than a static optimizer. Current ADORE can run real large applications: • Adore/Itanium on the Bio-informatics application BLAST (millions lines of code). • 58% speed up on some long queries • Adore/Sparc on the application Fluent • 14.5% speed up on Panther
ADORE/Sparc • ADORE has been ported to Sparc/Solaris platform since 2005. • ADORE uses the libcpc interface on Solaris to conduct runtime profiling. A kernel buffer enhancement is added to Solaris 10.0 to reduce profiling and phase detection overhead • Reachability is a true problem. (e.g. Oracle, Dyna3D) • Lack of branch trace buffer is painful. (e.g. Blast)