520 likes | 694 Views
Dynamic Binary Optimization – Part 1. 2006. 9.25 Nam, E Hyun. Contents. Overview Dynamic program Behavior Profiling Optimizing Translation blocks. Overview : Optimization. Optimization Migration of VM consideration from compatibility to performance Goal
E N D
Dynamic Binary Optimization – Part 1 2006. 9.25 Nam, E Hyun
Contents • Overview • Dynamic program Behavior • Profiling • Optimizing Translation blocks
Overview : Optimization • Optimization • Migration of VM consideration from compatibility to performance • Goal • To close the gap between a guest’ emulated performance and native platform performance • Type • Translation block chaining • Enlarging the translation block • Reordering translated instructions • Conventional complier optimization techniques
Overview : Profile • Profile • Statistics regarding a program’s behavior • A guide for making optimization decision • Common optimization strategy is to use profiling to determine the path that are predominantly followed by control flow • Type of profile information • Instructions( or Basic Blocks ), more heavily executed • Sequence in which BB are most commonly executed • Behavior of particular data variables and addresses
Overview : Profile • Advantage of profile information • Providing information that may not have been available when a program was originally compiled
Overview : BB rearrangement • Definition • Method, so that predominant path has instructions in consecutive memory location • Advantages • Nice localization • Efficient instruction fetching • Type • Trace • Superblock • Tree group
Overview : Staged emulation • Relation between emulation and optimization • Tightly integrated with emulation • Optimization is part of an emulation framework that support staged emulation • Staged emulation • Based on tradeoff between start-up time and steady state performance • Interpretation Binary translation Dynamic binary optimization
Overview : Staged emulation • Stages of staged emulation • Interpretation • BB translation( e.g. chaining ) • Optimized translation( e.g. superblock ) • Highly optimized translation
Overview : Staged emulation strategy • Strategy decision factors • Source and target ISA • Type of VM being implemented • Design objective • Tradeoff between Obtained optimization performance and optimization, profiling overhead • Example • Original HP Dynamo system, Digital FX!32 • Interpret optimized, translated code • DynamoRIO • Simple binary translation optimization • Shade • Interpretation simple binary translation
Contents • Overview • Dynamic program Behavior • Profiling • Optimizing Translation blocks
Dynamic program behavior • Goal • Optimization depends on program’s structure and dynamic behavior • By profiling, optimization system can learn about program’s structure and dynamic behavior • Important characteristics of program • High predictability of dynamic control flow • Correlation of branch direction, between current and most recent previous execution
Dynamic program behavior • Important characteristics of program • Backward instruction • Is typically taken • Predictability of indirect jump • Switch statement • Return from procedure call • Predictability of data value
Contents • Overview • Dynamic program Behavior • Profiling • Overview • Role • Type • Collecting the profile data • Profile during interpretation • Profiling translated code • Overhead • Optimizing Translation blocks
Profiling : Role • Definition • The process of collecting instruction and data statistics for an executing program • Usage • Input to code-optimization process • Principle of profiling • Predictability of program • Past behavior will often hold for future behavior
Profiling : Role • Traditional profiling & optimization procedure • Decomposing the source program into control flow graph • Analyzing the graph and inserting probes to collect profile information • Program running with a typical data input • Generating profile data • Static profile log analysis • Generating optimized code • Property • Fully analyzed • Optimal placement of probe • Entire program run and complete profile
Profiling : Role • Difficulty, requirement and limitation in dynamic optimization • Program structure is not known when a program begins • Program structure must be discovered in an incremental way • Inserting profiling probes in a globally optimal manner • Optimization decision must be made as early as possible • Statistics from a partial execution of the program
Profiling : Role • Tradeoff between overhead and benefit • Overhead : Initial analysis + actual collection of profile data • Benefit : execution time reduction due to optimization • Static optimization • Overhead are paid once • Dynamic optimization • Overhead are paid every time a guest program runs • Benefits must outweigh the Overhead
Profiling : Type of profile data • Frequency of Execution of different code region • Hotspot • Interpretation VS binary translation • Profile data which is based on Control flow( branch and Jump ) predictability • Can be used for determining aspects of a program’s dynamic execution behavior • Used as basis for gathering and rearranging BBs into larger unit • Used to guide specific optimization • Address • Data
Profiling : Type of profile data • Basics • Nodes : BBs • Edges : flow of control • BB profile • Numbers are counts of the corresponding BB’s execution • Edge profile • BB profile can be derived from edge profile • Path profile • Approximate the path profile by using a heuristics based on edge profile
Profile : collecting the profile • Instrumentation based profiling • Target program related events • Count all instances of the event being profiled • Many different events can be monitored simultaneously • Monitoring method : HW, SW • Sampling based profiling • Program runs in its unmodified form • Program is interrupted and an instances of program related event is captured • Tradeoff • Instrumentation based • slow but can collect given number of profile data over much shorter period of time • Sampling based • fast but requires a longer time for collecting the same amount of profile information
Profile : collecting the profile • Strategy • Collection technique depends on emulation spectrum • Interpretation • SW instrumentation is about the only choice • Optimizing binary translation, dynamic optimization system • Instrumentation • Already well optimized longer running program • Sampling
Profile : profiling during interpretation • Key points • Source instructions are actually access as data • Profiling code must be added to the interpret routine • Profiling is applied to specific instruction type rather than specific instruction • It can be applied for Certain classes of instructions rather than specific instruction • E.g. Backward branch • Method • BB profile • profile code should be added to all control transfer instructions after the PC bas been updated • Edge profile • Both the PC of the control transfer instruction and the targetPC are used to define a specific instruction
Profile : profiling during interpretation • Profile Table • Access method • BB profile : Via PC value of control transfer destination • Edge profile : PC value that define an edge • Hash function • Contents of entry • Basic block or edge count • For conditional branch, taken count and not taken count
Profile : profiling during interpretation • Profile Count decaying • Problem of profile table • A count field overflow • Solution • Key point • Optimization method focus on not absolute count but relative frequency • Recent program event history is more valuable than that of past • Decay process • Periodically divide all the profile count by 2
Profile : profiling during interpretation • Profiling Jump Instruction • Difficulties of Jump compared with conditional branch • Switch statement : frequently change • Return from procedure call : many target address • Solution • Key point • Profile-driven optimization of indirect jump tend to be focused on those jumps that very frequently have the same target • Maintain profile table with a small number of target address and track only the more recently used target
Profile : profiling translated code • Instrumenting individual instructions • Each individual instruction can have its own custom profiling code • = Profiling can be selectively applied • = Profile counters can be assigned to each static instructions • Profile counters can be directly addressed without hashing • Profile code can be easily inserted and removed as needed
Profiling : Overhead • Performance overhead • Example • To access hash table : hash function + 1 load + 1 compare • To increment proper count : 1 load + 1store + 1add • Profiling during interpretation VS profiling translated code • Absolute overhead VS relative overhead • Memory overhead • Profile table • Overhead reduction method • Reducing the number of instrumentation point • Heuristic + Using collected data • Code duplication • Attractive for same-ISA optimization ( 4.7 )
Contents • Overview • Dynamic program Behavior • Profiling • Optimizing Translation blocks • Overview • Improving locality • Traces • Superblocks • Dynamic superblocks formation • Tree group
Optimizing translation blocks : Overview • Two strategy • Improving locality • Optimization on enlarged translation blocks
Optimizing translation blocks : Improving locality • Locality • Temporal • Spatial • Problem • Cache space • Performance • Low instruction fetch bandwidth
Optimizing translation blocks : Improving locality • Rearrange the layout of the blocks in memory • Conditional branch tests are reversed • Unconditional branch removal/Add • Instruction fetch efficiency is improved
Optimizing translation blocks : Improving locality • Procedure inlining
Optimizing translation blocks : Improving locality • Partial procedure inlining • In dynamic optimization system
Optimizing translation blocks : Improving locality • Pros and Cons of procedure inlining • Pros • Increase spatial locality • Remove overhead • Call and return instructions are removed • Save/restore instruction are removed • Cons • Increase code size • Increase register “pressure” • Inlined code needs more register than procedure call • Con sequently, procedure inlining is typically used only for those procedures that are very frequently called and are very small
Optimizing translation blocks • Three ways of rearranging basic blocks according to control flow • Trace formation • Superblock formation • Most widely used in VM implementation • Tree group • Useful when control flow is difficult to predict • Provide wider scope for optimization
Optimizing translation blocks : Traces • Traces • Chunks of contiguous instructions containing multiple BBs • Traces > Superblock • Static traces forming step • 1. Profile collection using test data • 2. Begin with start point • Most frequently executed BB ,not already part of a trace • 3. Collection BB through most common control path, until a stopping condition is met • A block already belonging to another trace is reached • The arrival at a procedure call/return boundary • 4. Collect the BBs into a trace • Reverse branch tests • removing/adding unconditional branch • 5. stop otherwise go to step 2 • In dynamic environment, Traces are not commly used s translation blocks
Optimizing translation blocks : Superblocks • Superblocks VS Traces • Side entrance • Problems in forming superblocks • Small and a number of superblocks • Too small to provide many opportunities for optimizations • Tail duplication • The process of replicating code that appears at the end of a superblock in order to form other superblock
Optimizing translation blocks : Dynamic superblock formation : Overview • Dynamic • Formed incrementally as the source code is being emulated • Complication • BB replication leads to more choices • Key question • Starting point • Continuation • Stopping point
Optimizing translation blocks : Dynamic superblock formation : starting point • Heavily used block • By using Profile information • Method for determining profile points • All basic block • Heuristics • Targets of backward branches an candidates starting point • Exit arc from an existing superblock • Start threshold • When a profiled BB’s execution frequency reaches this value, a new superblock is started • Depends on emulation tradeoff • A few tens to hundreds of execution is typical
Optimizing translation blocks : Dynamic superblock formation : Continuation • Continuation • Which subsequent blocks should be collected and added as the superblock is grown • Most frequently used approach • Node profile information is used to identify the most likely successor BB • Continuation threshold • A relatively complete set of profile data must be collected for all BBs • Typically half of start point threshold • Continuation set • At the time superblock formation is to begin, the set of all BBs that have reached the continuation threshold is collected
Optimizing translation blocks : Dynamic superblock formation : Continuation • Most frequently used procedure
Optimizing translation blocks : Dynamic superblock formation : Continuation • Most Recently used approach • Edge profile information • Algorithm • Assumption • The very next sequence of blocks following a start point is also likely to be a common path • Simply follows the actual dynamic control flow path one edge at a time • Advantage • Only candidate start point need to be profiled • = No need to use profiling for continuation blocks • = Profile overhead is substantially reduced
Optimizing translation blocks : Dynamic superblock formation : stopping point • Type of heuristics to determine stop condition • The start point of the same superblock is reached • A start point of some other superblock is reached • A superblock has reached some maximum length • A BB can be used in more than one superblock there may be multiple copies of a given BB Explosion of code size • When using the most frequently used heuristic, there are no more candidate BBs that have reached the candidate threshold • An indirect jump is reached, or there is a procedure call
Optimizing translation blocks : Dynamic superblock formation : Example • Most frequently used
Optimizing translation blocks : Dynamic superblock formation : Example • Most Recently used • Profile point is just A because A is target of backward branch • Most likely • ADEG BCG FG • However • There is about 30% chance • ABCG DEG FG • There are cases where a most recently executed method may not select superblocks quite as well as most frequently executed method
Optimizing translation blocks : Tree group • Background • Problems when applying Superblock for Branches that tend to almost evenly split their decision • Side exit is frequently taken compensation code overhead • Optimization are typically not done along the side exit losing performance improvement opportunities • Traces, Superblock VS Tree group • Tree group • conditional branch outcomes are more evenly balanced • Generalization of superblock • Multiple flow of control • Superblocks • Conditional branches are predominantly decided one way • Single flow of control