380 likes | 472 Views
Online partial evaluation of bytecodes (3). rei@415. Run-time Optimization. Overview of Specialization The DyC System The Dynamo System Run-time Specialization for ML. Specialization. Input1. Input2. Program. Output. Input1. Input2. ProgramInput1 is a specialized program
E N D
Run-time Optimization • Overview of Specialization • The DyC System • The Dynamo System • Run-time Specialization for ML
Specialization Input1 Input2 Program Output Input1 Input2 • ProgramInput1 is a specialized program • Specializer performs specialization of Program with respect to Input1 Program Output Input1 Input2 Program Specializer ProgramInput1 Output 1 2 Specializer add succ 3
Flow-Chart Language n m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 5 52
Flow-Chart Language n m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 ? ?2
Flow-Chart Language n m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 ? ?2 Program2
Flow-Chart Language n m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 ? ?2 Specializer Program2
Flow-Chart Language m (init) init: goto test0; test0: goto loop0; loop0: result := 1 * m; goto test1; test1: goto loop1; loop1: result := result * m; goto test2; test2: goto end; end: return result; ? ?2 Program2
What is Inside the Specializer? • A specializer is a program processor • Originally seen a static source-to-source transformation • Two families of specializers: • Online specialization (one pass), • Offline specialization (two passes)
Online Specialization Offline Specialization Program Program It’s Input1 that will be given first Input1 Analyse Specializer Annotated Program Input1 Specializer ProgramInput1 (Specialized program) ProgramInput1 (Specialized program) Input2 Input2 Output Output
DyC’s Run-time Specialization • It’s specialization that occurs at run-time • We define run-time by the availability of the input • The optimization entails an obvious trade-off cost/performance • Typically expected situation for a couple of variables: • The static variable is constant and appears earlier, and • The dynamic variable appears later and is not constant • The DyC system is a run-time specialization system with offline specialization of FCL
Online Specialization Offline Specialization Program Program It’s Input1 that will be given first Run-time Input1 Analysis Run-time Specializer Annotated Program Run-time Input1 Run-time Specializer ProgramInput1 (Specialized program) ProgramInput1 (Specialized program) Input2 Input2 Output Output
DyC (Generating Extension) Offline Specialization It’s Input1 that will be given first Program Program It’s Input1 that will be given first Analysis Analysis Annotated Program Annotated Program Specialized versionof the specializer w.r.t.to the annotated program Cogen Run-time Input1 Run-time Specializer Generating Extension (Custom Specializer) Run-time Input1 ProgramInput1 (Specialized program) ProgramInput1 (Specialized program) Input2 Input2 Output Output
‘It’s Input1 that will be given first’ make_static (n) m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 5 52 Run-time
‘It’s Input1 that will be given first’ Specialize program w.r.t. n if not in cache -- make_static (n) m (init) init: result := 1; goto test; test: if n < 1 then end else loop loop: result := result * m; n =: n – 1 goto test: end: return result; 2 5 52 Run-time
‘It’s Input1 that will be given first’ Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … 2 5 52 Cache Program4 Program3 Run-time
‘It’s Input1 that will be given first’ Specialize program w.r.t. n if not in cache -- make_static (n) m (init) … 2 5 52 Cache Program4 Program3 Program2 Run-time
Annotation-directed Run-time Optimization • The underlying language is a version of the Flow-Chart Language • Annotations help avoiding non-termination and unneeded specialization: • Eager: Aggressive speculative specialization • Lazy: Demand-driven specialization • Annotations guide cache policy: • CacheAllUnchecked variable: disposable code • CacheOne variable: the current version is cached
DyC (Generating Extension) Program Annotated C Make_static (…) Analysis AnnotatedIntermediate Representation Annotated Program Cogen Native Code Generating Extension (Custom Specializer) Run-time Input1 Native Code ProgramInput1 (Specialized program) Input2 Output
The Dynamo System • 6 years long project • Transparent operation (custom crt0.o) • Dynamo is an PA-8000 code interpreter • Assumption: “Most of the time is spent in a small portion of the code” • Performance opportunities: • Redundancies that cross program boundaries • Cache utilization • It interprets until a trace is detected • For which it generates a fragment
Most Recently Executed Tail • A trace is delimited by start-of-trace and end-of-trace conditions • Start-of-trace condition: • Target of backward-taken branches • loop header • Taken branches from fragment code exit • End-of-trace condition: • Backward-taken branches • only loops whose header is the start-of-trace are allowed to appear in the trace • Taken branches to fragment code entry • Each branch is associated with a counter
Native instruction stream Run-time Native Code Interpret until taken branch Interpreter
Native instruction stream Run-time no Native Code miss Start-of-trace condition? Interpret until taken branch Lookup branch target in cache yes no Interpreter Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? yes Interpret+codegenuntil taken branch yes End-of-tracecondition? no
Native instruction stream Run-time Interpret until taken branch Interpreter Intermediate Representation Fragment cache Optimizer Emit into cache, link withother fragments & recyclethe associated counter Create new fragment and optimize it
Native instruction stream Run-time no Native Code miss Start-of-trace condition? Interpret until taken branch Lookup branch target in cache hit yes no Interpreter Jump to topof fragmentin cache Increment counter Associated with Branch target addr Counter value Exceeds hot threshold? yes Intermediate Representation Context switch Interpret+codegenuntil taken branch Fragment cache Optimizer Emit into cache, link withother fragments & recyclethe associated counter Create new fragment and optimize it yes End-of-tracecondition? no
Dynamo: Notes • The whole overhead is less than 1.5% • The optimization part contribution is negligible • The average overall speedup is about 9% • Dynamic branches are treated in a lazy way: • We may loop to the start-of-trace • If the target is in the cache, we’re done • If not, we return to the interpreter • The optimizer actually performs a partial evaluation similar to DyC’s one
Run-time Specialization for ML • The ML virtual machine features partialapplication by means of closures • We may see currying as an annotation that guides run-time specialization • By use of a peannotation, we would like to perform on-demand specialization • merge list1 list2 • (pe merge list1) list2 • In the context of a virtual machine, specilization is a bytecode-to-bytecode transformation
Online Specialization + JIT ML bytecode Program pe Run-time Input1 Run-time Specializer ProgramInput1 (Specialized program) ML bytecode JIT ProgramInput1 (Specialized program) Native code Input2 Output
Online Specialization + JIT DyC (Generating Extension) Program Program Make_static (…) pe Run-time Analysis Input1 Run-time Specializer Annotated Program ProgramInput1 (Specialized program) Cogen Generating Extension (Custom Specializer) JIT Run-time ProgramInput1 (Specialized program) Input1 ProgramInput1 (Specialized program) Input2 Input2 Output Output
DyC (Generating Extension) Online Specialization + JIT Standard Compilation Program Program Make_static (…) pe Run-time Non-Standard Compilation Analysis Input1 OnlineStrategy Run-time Specializer Annotated Program Portability ProgramInput1 (Specialized program) OfflineStrategy Cogen Portability Generating Extension (Custom Specializer) Reusable JIT Run-time ProgramInput1 (Specialized program) Input1 ProgramInput1 (Specialized program) Input2 Input2 Output Output
switch (*code_ptr) { case ACC: code_ptr++; accu := stack[*code_ptr++]; break; case POP: code_ptr++; stack += *code_ptr++; break; … Interpreter ThreadedInterpreter void* array[] = {&&lbl_ACC, &&lbl_POP, …} lbl_ACC: code_ptr++; accu := stack[*code_ptr++]; goto *array[*code_ptr]; lbl_POP: code_ptr++; stack += *code_ptr; goto *array[*code_ptr]; … Just-in-timeCompilation
Just-in-time Compilation • Just-in-time compilation is the natural step following threaded code • Also involves inlining of the virtual machine • Implemented via the GCC’s asm instruction ACC arg = 0x8B 0x44 0x24 0x4*arg (movl 4*arg(%esp,1),%eax) POP arg = 0x59 0x59 … (popl %ecx;popl %ecx;…) CONSTINT arg = 0xB9 0x00 0x00 0x00 0xarg 0xD1 0xE0 0x40 (movl $arg,%eax; shl %eax; inc %eax)
Specialization Algorithm • Implemented as an interpreter/[JIT-]compiler • Performs an aggressive: • Constant propagation • Unfolding of [recursive] function calls • Manipulates symbolic values: • <Static, Integer> • <Static, Pointer> • <Dynamic, ?> • <Dynamic, Integer, Greatest Lower Value> • <Dynamic, Pointer, Greatest Lower Length>
Mixed Computation • Interpreting static expressions • Residualizing dynamic expressions Context Specializer 0: <S,3> 1: <S,4> 0: <S,3> 1: <D,?> 0: <D,?> 1: <S,4> 0: <D,?> 1: <D,?> CONSTINT 7 ACC 0 PUSH CONSTINT 4 PUSH ACC 1 ADDINT POP 1 ACC 0 PUSH CONSTINT 3 ADDINT ACC 1 PUSH ACC 1 ADDINT ACC 1 PUSH ACC 1 ADDINT Subject Program Specialized Program
Program Point Specialization • We define a program point as either: • the entry of the program, • the else-branch of dynamic conditional branches, • the application of a recursive function • We define a context as the state of the stack when we enter a program point • Each specialized program point is associated with a residual code • We maintain a cache of recursive function applications together with a its context
Algorithm Input: A program point and a context Output: The corresponding specialized program processed = {} pending = {pp0, cx0} while pending != {} do {pp, cx} := one element of pending if {pp, cx} notin processed then specialize pp w.r.t. cx processed := processed + {pp, cx} pending := pending – {pp, cx} arrange processed Specializer
Notes • Non-termination is avoided by means of the cache of recursive function applications: • A BRANCHEND bytecode is added to make lifting easier in a dynamic context • Branches with side-effect are ruled dynamic • The actual work is perforned by the specialize function
Conclusion • We’ve seen two recent run-time optimizing systems (PLDI’99 & PLDI’00) • DyC performs accurate optimizations thanks to: • advanced specialization • programmer annotations • Dynamo’s faster than native code execution: • despite the interpretive overhead • while staying completely transparent • We would like in a similar vein to: • speed up ML interpretation • while retaining portability