310 likes | 436 Views
Dynamic Binary Optimization. Presenter Kim Jin Chul. Contents. 1. Overview of Applying Optimization on VMs. 2. Dynamic Program Behavior. 3. Profiling. 4. Optimizing Translation Blocks. addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory
E N D
Dynamic Binary Optimization Presenter Kim Jin Chul
Contents 1 Overview of Applying Optimization on VMs 2 Dynamic Program Behavior 3 Profiling 4 Optimizing Translation Blocks
addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory add r7, r17, r7 ; perform add of %edx addi r16, r4, 4 ; add 4 to %eax stwx r7, r2, r16 ; store %edx value into memory Classical Optimizations addl %edx, 4(%eax) movl 4(%eax), %edx Translation from IA-32 to PowerPC code. Adopt a Common Subexpression Elimination addi r16, r4, 4 ; add 4 to %eax lwzx r17, r2, r16 ; load operand from memory add r7, r17, r7 ; perform add of %edx stwx r7, r2, r16 ; store %edx value into memory
Optimization Based on Profiling Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0 Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0 Basic Block A ... ... R3 ← ... R7 ← ... Br L1 if R3 == 0 Compensation code R1 ← R2 + R3 Basic Block B ... R6 ← R1 + R6 ... ... Basic Block B ... R6 ← R1 + R6 ... ... Basic Block B ... R6 ← R1 + R6 ... ... use Basic Block C L1: R1 ← 0 ... ... Basic Block C L1: R1 ← 0 ... ... Basic Block C L1: R1 ← 0 ... ... def
Compensation code R1 ← R2 + R3 Basic Block B L2:... R6 ← R1 + R6 ... ... Optimization Based on Profiling Basic Block A ... ... R3 ← ... R7 ← ... R1 ← R2 + R3 Br L1 if R3 == 0 Superblock ... ... R3 ← ... R7 ← ... Br L2 if R3 != 0 R1 ← 0 ... ... Basic Block B ... R6 ← R1 + R6 ... ... Basic Block C L1: R1 ← 0 ... ...
Stages: Interpret Basic translation Optmized block Highly optimized blocks Fast startup Very slow startup Slow steady state Fast steady state Simple profiling Extensive profiling A staged optimization system Interpreter Binary memory image Basic block cache Code cache Profile data Optimizer Translator Emulation manager
Dynamic Program Behavior • Dynamic control flow is highly predictable . . R3 ← 100 loop: R1 ← mem(R2) Br found if R1 == –1 R2 ← R2 + 4 R3 ← R3 – 1 Br loop if R3 != 0 . . found: . . .
50% 40% 30% 20% 10% 0% 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% Dynamic Program Behavior • Distribution of taken conditional branches Fraction of static conditional branches Percent taken Predominantly not taken : 28% Predominantly taken : 42% Back...
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 176.gcc 181.mcf 197.parser 252.eon 256.bzip2 171.swim 173.applu 177.mesa 187.facerec 189.lucas Dynamic Program Behavior • Consistency of conditional branches • The high percentage consists of backward branches Dynamic branches decided same as previous time Benchmark SPEC
25% 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 >9 Percent of indirect jumps Number of different destinations Dynamic Program Behavior • The predictability of indirect jumps • Some jump destination addresses seldom change
0.7 0.6 0.5 0.4 Fraction with constant value 0.3 0.2 0.1 0 All Add/Sub Load Logic Shift Set Instruction type Dynamic Program Behavior • The predictability of data value Static instructions always compute the same value Static Dynamic instructions execute the static instructions Dynamic
Profiling • The process of collecting instruction and data statistics for an executing program • Optimization based on profiling work Interpreter Binary memory image Basic block cache Code cache Profile data Optimizer Translator Emulation manager Back...
A B C D E F The Role of Profiling • Traditional profiling HLL Program Compiler Frontend Compiler Backend Instrumented Code Instrumented Code Program Execution Program Statistics Optimizing Compiler Optimized Binary Test Data
A B D E The Role of Profiling • On-the-fly profiling in a dynamic optimizing VM Partial Program Statistics Translator/ Optimizer Program Binary Interpreter Program Data
Types of Profiles • Several types of profile data • How frequently different code regions are being executed? • It can be used to decide the level of optimization • Is control flow predictability? • It may be used as the basis for gathering and rearranging basic blocks • Rearranged basic blocks get a chance to be merged superblock
A A 65 50 15 B C B C 50 15 50 12 13 17 48 D D 38 25 10 2 E E 15 48 F F 17 Types of Profiles A basic block profile A edge profile
Collecting Profiles • Instrumentation-based profiling • Specific program-related events and counts all instances of the events being profiled • Software-based Vs Hardware-based • Speed? Support? Flexibility? • Sampling-based profiling • Program runs in its unmodified form, the program is interrupted and event is captured • Instrumentation Vs Sampling • Overhead : Instrumentation < Sampling • Sampling causes traps!
Branch PC HASH Takencount Not-takencount PC Profiling During Interpretation Instruction function list..branch_conditional(inst) { BO = extract(inst, 25, 5); BI = extract(inst, 20, 5); displacement = extract(inst, 15, 14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken); PC = PC + displacement; Else profile_cnt(profile_addr, nottaken); PC = PC + 4; } Profile Table for Collecting an Edge Profile During Interpretation PowerPC Branch Conditional Interpreter Routine
Profiling Translated Code increment edge counter (i)if (counter (i) > trigger) then invoke optimizerelse branch to fall-through basic block increment edge counter (j)if (counter (j) > trigger) then invoke optimizerelse branch to target basic block Edge Profiling Code Inserted into Stubs of a Binary Translated Basic Block Emulation Stages
Profiling Overhead • For profiling during interpretation, occurring 10-20% overhead • Profiling overheads can be reduced • To reduce the number of instrumentation points by selecting a smaller set of key points
Optimizing Translation Blocks • Two-part strategy for optimzing • Using dominant control flow for enhancing memory locality • Making a translation blocks larger • Traces, Superblocks, Tree groups • Two parts of the strategy are actually relatively independent
Improving Locality • Two kinds of memory localities • Spatial locality • Access to a memory location is soon followed by a memory access to an adjacent memory location • Temporal locality • Access to a memory location is accessed again in the near future
3 A 30 70 D B 1 29 68 2 E F C 29 68 1 G 97 1 Improving Locality • Example code sequence A Br cond1 == true B Br cond2 == false C Br uncond D Br cond3 == true E Br uncond F G Br cond4 == true
3 A 30 70 D B 1 29 68 2 B E F C 29 68 1 G 97 1 Improving Locality • Rearrange the blocks in memory A Br cond1 == false D Br cond3 == true E G Br cond4 == true Br uncond Br cond2 == false C Br uncond F Br uncond
Improving Locality A • Procedure Inlining • Positive & NegativeEffect? A X X Y A Y Z Call proc xyz Proc xyz B B X B ... ... ... Y K K Z K X X Return Call proc xyz L Z Y L Z L
3 A Trace 1 Trace 2 30 70 Traces D B Superblocks Trace 3 1 29 68 2 E F C 29 68 1 Relations between Superblocks and Traces G 97 1 Traces • Trace • A contiguous sequence • Both side entrances and side exits
3 A A 30 70 D D B B 1 29 68 2 E E F C F C 29 68 1 G G G G 97 1 Superblocks • Superblocks • Regions of code with only one entry and one or more exit points
B B Superblocks A A Br cond1 == false Br cond1 == false D D Br cond3 == true Br cond3 == true E E G G Br cond4 == true Br cond4 == true Br uncond Br uncond Br cond2 == false Br cond2 == false C C G Br uncond Br cond4 == true Br uncond F F G Br cond4 == true Br uncond Br uncond
A D B E F C G G G Tree Groups • Tree groups • Regions of code with only one entry and one or more exit points Figure 4.7
SPEC benchmarks • Integer SPEC benchmark • 176.gcc – GNU Compiler • 181.mcf – Combinatorial Optimization • 197.parset – Word Processor • 252.eon – Computer Visualization • 256.bzip2 – Compression • Floating-Point SPEC benchmark • 171.swim – Shallow Water Modeling • 173.applu – Parabolic • 187.facerec – Imageprocessing • 189.lucas – Number Theory Back...