240 likes | 416 Views
Fully Dynamic Specialization. AJ Shankar OSQ Lunch 9 December 2003. “That’s Why They Play the Game”. Programs are executed because we can’t determine their behavior statically! Idea: Optimize programs dynamically to take advantage of runtime information we can’t get statically
E N D
Fully Dynamic Specialization AJ Shankar OSQ Lunch 9 December 2003
“That’s Why They Play the Game” • Programs are executed because we can’t determine their behavior statically! • Idea: Optimize programs dynamically to take advantage of runtime information we can’t get statically • Look at portions of the program for predictable inputs that we can optimize for
Unpredictable Unpredictable Predictable Predictable Generic G P2 P3 P4 Specialization • Recompile portions of the program, using known runtime values as constants • Possibly many variants of the same code • Allow for fallback to original code when assumptions are not met • Predictable == recurrent
How It Works • Chose a good region of code to specialize: after a good predictable instruction • Insert dispatch that checks the result of the chosen instruction • Recompile code for different results of the instruction • During execution, jump to appropriate specialized code LOAD pc X = … X = … Dispatch(X) Dispatch(X) Dispatch(X) Spec1 Spec1 Spec1 Spec2 Spec2 Spec2 Default Default Default … … … … … … Rest of Code
When Is This a Good Idea? • Any app whose execution is heavily dependent on input • For instance • Interpreters • Raytracers • Dynamic content producers (CGI scripts, etc.)
Specialization Is Hard! • Specializing code at runtime is costly • Can even slow the program down • Existing specializers rely on static annotations to clue them in about profitable areas • Difficult to get right • Limits specialization potential
Existing: DyC, Cyclone, etc. • Explicitly annotate static data • No support for automatic specialization of frequently-executed code • Could compile lots of useless stuff • No concrete store information • Doesn’t take advantage of the fact that memory location X is constant for the lifetime of the program
Existing: Calpa • Mock, et al, 2000. Extension to DyC. • Profile execution on sample input to derive annotations • But converting a concrete profile to an abstract annotation means • Still unable to detect concrete memory constants • Frequently executed code for arbitrary input? • Still needs source, is offline!
Motivating Example: Interpreter Sample interpreted program: X = 10; … WHILE (Z != 0) { Y = X+Z; … } while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break; ... } } • X is constant after initialization • concrete memory location • Y = X+Z executed frequently
Motivating Example: Interpreter Sample interpreted program: X = 10; … WHILE (Z != 0) { Y = X+Z; … } while(1) { while (pc == 15) { // Y = X + Z env[3] = 10 + env[2]; … // Z != 0 ? if (env[2] == 0) pc = 19; } else { // normal loop } } while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break; ... } }
A More Concrete Approach • Do everything at runtime! • Specialize on execution-time hot values • Know which concrete memory locations are constant • Other benefits of this approach: • Specialize temporally, as execution progresses • Specialize dynamically loaded libraries as well • No annotations or source code necessary
A Quick Recap • Chose a good region of code to specialize • Insert dispatch that checks the result of the chosen instruction (the “trigger”) • Recompile code for different values of a hot instruction • During execution, jump to appropriate specialized code LOAD pc X = … X = … LOAD pc Dispatch(X) Dispatch(pc) Dispatch(X) Dispatch(X) Spec1 pc=15 Spec1 Spec1 Spec2 Spec2 pc=27 Spec2 Default Default Default while(1) … … … … … … … Rest of Code
The Details • Need to identify the best predictable instruction • Specializing on its result should provide the greatest benefit • To find it, gather profile information about all instructions • Need to actually do the specializing
Instrumentation: Hot Values • What’s a hot value? One that occurs frequently as the result of an instruction • x % 2 has two very hot values, 0 and 1 • Good candidate instructions are predictable: result in (only) a few hot values • For instance, small_constant_table[x], but not rand(x) • Case study: Interpreter • Predictable instructions: LOAD pc, instr.opcode instr = instrs[pc]; switch(instr.opcode) { … }
Instrumentation: Store Profile • Keep track of memory locations that have been written to • Idea: if a location hasn’t been written to yet, it probably won’t be later, either • Case study: Interpreter • Store profile says env[Y] written to a lot, but env[X], instrs[] never written to regs[instr.res] = regs[instr.op1] + regs[instr.op2];
Invalidating Specialized Code • Memory locations may not really be constant • When ‘constant’ memory is overwritten, must invalidate or modify specializations that depended on it • How does Calpa handle invalidation? • Computes points-to set • Inserts invalidation calls at all appropriate points (offline) • Too costly an approach, without modification
Invalidation Options Class Interpreter { private Instruction[] instrs; void SetInstrs(Instruction[] is) { instrs = is; } } • Write barrier • Still feasible if field is private • On-entry checks • Feasible if specialization depends on a small number of memory locations • e.g. Factor(BigInt x) • Hardware support • e.g. Mondrian • Ideal solution • Possible to simulate? Hot Instruction CheckMem Dispatch Invalidate Spec1 Default
Specialization Algorithm • Find good candidate instructions • Predictable • Frequently executed • For each candidate instruction • Simultaneously evaluate method using constant propagation for some of its hot values • Compute overall cost/benefit • Choose the best instruction
Specializing the Interpreter while(1) { i = instrs[pc]; switch(instr.opcode) { case ADD: env[i.res] = env[i.op1] + env[i.op2]; pc++; break; case BNEQ; if (env[i.op1] != 0) pc = env[i.op2]; else pc++; break; ... } } Candidates: Instr.opcode: Executed very frequently A small handful of values pc: Executed very frequently More values, but still reasonable
Specializing on instr.opcode Dispatch(opcode) LOOP: i = instrs[pc] switch(ADD) switch(i.opcode) i.opcode = ADD benefit = 1 switch(ADD) case ADD: … … i.opcode = ADD benefit = 2 case ADD: env[i.res] = env[i.op1]+env[i.op2] i.opcode = ADD env[i.res] = env[i.op1]+env[i.op2] pc = pc + 1 i.opcode = ADD pc = pc + 1 goto LOOP i.opcode = ADD benefit = 3 goto LOOP i.opcode = ADD LOOP: i = instrs[pc] {} Other values of opcode have similar results…
Specializing on pc Y = X + Z Dispatch(pc) LOOP: i = instrs[15] LOOP: i = instrs[pc] pc = 15 benefit = 1 LOOP: i = instrs[15] switch(i.opcode) pc = 15 ; i = ADD Y, X, Z benefit = 2 switch(ADD) case ADD: … … pc = 15 ; i = ADD Y, X, Z benefit = 3 case ADD: env[i.res] = env[i.op1]+env[i.op2] pc = 15 ; i = ADD Y, X, Z benefit = 6 env[Y] = 10 + env[Z] pc = 15; i = ADD Y, X, Z pc = pc + 1 benefit = 7 pc = 15 + 1 pc = 16 ; i = ADD Y, X, Z goto LOOP benefit = 8 LOOP: i = instrs[16] pc = 16 ; i = BNEQ Z, 15 benefit = 9 switch(BNEQ) pc = 16 ; i = BNEQ Z, 15 benefit = 10 if (env[Z] != 0) pc = 16 ; i = BNEQ Z, 15 benefit = … pc++; …
Final Result • Choose to specialize on pc because benefit is far greater than for instr.opcode • Generate different versions for each of the hottest values of pc • Terminate loop unrolling either naturally (when we don’t know what pc is anymore) or with a simple heuristic
Implementation Ideas • Use Dynamo • Hot trace as basis for specialization • Intuitively, follow the lifetime of an object as it travels through the program across function boundaries • Unfortunately, closed-source, and API isn’t expressive enough
Implementation Ideas • JikesRVM • Java VM written in Java • Has a primitive framework for sampling • Has a fairly sophisticated framework for dynamic recompilation • Does aggressive inlining • Only instrument hot traces (but compiler is slow…)