240 likes | 376 Views
How to select superinstructions for Ruby. ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical and Computing Sciences ** Tokyo University, Information Technology Center. Ruby. Dynamic language Becoming popular recently
E N D
How to select superinstructions for Ruby ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical and Computing Sciences ** Tokyo University, Information Technology Center
Ruby • Dynamic language • Becoming popular recently • Numeric benchmarks 100—1000 times slower than equivalent program in C Numeric benchmarks marked in red * http://shootout.alioth.debian.org/
Interpreter optimization efforts • Many techniques to optimize interpreter were proposed • Threaded interpretation • Stack top caching • Pipelining • Superinstructions • Superinstructions • Merge code of operations executed in sequence Focus of this presentation
Superinstructions (contrived example) Optimizations applied PUSH: // put <imm> argument on stack stack[sp++] = *pc++;goto **pc++;ADD: // add two topmost values on stack sp--; stack[sp-1] += stack[sp];goto **pc++; PUSH_ADD: // add <imm> to stack top stack[sp-1] += *pc++;goto **pc++; PUSH_ADD: // add <imm> to stack top stack[sp++] = *pc++;//goto **pc++; sp--; stack[sp-1] += stack[sp];goto **pc++; Dispatch eliminated
Superinstructions (effects) • Effects • Reduce dispatch overhead • Eliminate some jumps • Provide more context for indirect branch predictorby replicating indirect jump instructions • Allow more optimizations within VM op
Good for reducing dispatch overhead Superinstructions help when: • VM operations are small (~10 hwop/vmop) • Dispatch overhead is high (~50%) Examples of successful use in prior research • ANSI C interpreter: 2-3 times improvement (Proebsting 1995) • Ocaml: more than 50% improvement (Piumarta 1998) • Forth: 20-80% improvement (Ertl 2003)
Ruby does not fit well Superinstructions help when: • VM operations are small (~10 hwop/vmop) • Dispatch overhead is high (~50%) Only 1-3% misprediction overhead on interpreter dispatch BUT 60-140 hardware ops per VM op Hardware profiling data on Intel Core 2 Duo
Superinstructions for Ruby • We experimentally evaluated effect of “naive” superinstructions on Ruby • Superinstructions are selected statically • Frequently occurring in training run combinations of length 2 selected as superinstructions • Training run uses the same benchmark • Superinstructions constructed by concatenating C source code, C compiler optimizations applied
Naive superinstructions effect on Ruby Limited benefit 4 benchmarks Normalized execution time Unpredictable effects Number of superinstructions used
Branch mispredictions 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used
Branch mispredictions, reordered 2 benchmarks: mandelbrot and spectral_norm Normalized execution time Number of superinstructions used, reordered by execution time
So why Ruby is slow? • Profile of numeric benchmarks Garbage collection takes significant time Boxed floating point values dominate allocation
Floating point value boxing Typical Ruby 1.9 VM operation OPT_PLUS: VALUE a = *(sp-2); VALUE b = *(sp-1); /* ... */if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) { sp--; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)); } else { CALL(1/*argnum*/, PLUS, a); }goto **pc++; New “box” object is allocated on each operation
Proposal: use superinstructions for boxing optimization • 2 operation per allocation instead of 1 Boxing of intermediate result eliminated OPT_MULT_OPT_PLUS: VALUE a = *(sp-3); VALUE b = *(sp-2); VALUE c = *(sp-1); /* ... */if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) { sp-=2; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c)); } else { CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/); CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/); }goto **pc++;
Implementation • VM operations that handle floating point values directly: • opt_plus • opt_minus • opt_mult • opt_div • opt_mod • We implemented all 25 combinations of length 2 • Based on Ruby 1.9.1 • Using existing Ruby infrastructure for superinstructions with some modifications
Limitations • Coding style-sensitive • Not applicable to other types (e.g. Fixnum, Bignum, String) • Fixnum is already unboxed • Bignum and String cannot be unboxed • Sequences of 3 arithmetic instructions or longer virtually non-existent • No occurrences in the benchmarks
Evaluation • Methodology • median time of 30 runs • Reduction in allocation
Results • Up to 22% benefit on numeric benchmarks • No slowdown on other benchmarks
Slight modification produces 20% difference in performance 4 of 9 arithmetic instructions get merged into 2 superinstructions 24% reduction in float allocation Example: mandelbrot tweak Normalized execution time ITER.times do- tr = zrzr - zizi + cr+ tr = cr + (zrzr - zizi) - ti = 2.0*zr*zi + ci+ ti = ci + 2.0*zr*zi
Discussion of alternative approaches • Faster GC would improve performance as well • Superinstructions still apply, but with reduced benefit • Type inference • Would allow to specialize expressions and eliminate boxing • Interoperability with dynamic code is an issue • Dynamic specialization • Topic for further research
Related work: Tagged values • Use lower bits of pointers to trigger alternative handling • Embed floating point value into higher bits • Limited to 64-bit platforms, as Ruby uses double precision 64 bit floating point arithmetic • Our approach has same effect on 32 and 64 bit platforms • Allows to eliminate majority of boxed floats • Provides 28-35% benefit (on the same benchmarks) * Sasada 2008
Related work: Lazy boxing • Java-like language with generics over value-types • Boxing needed to avoid duplication of template instantiation code for primitive types • Lazy optimization works by allocating boxed objects in the stack frame, and moving to heap as needed • Relies on static compiler analysis for escape path detection, and runtime checks * Owen 2004
Related work:Superinstructions Superinstructions used for code compression • ANSI C hybrid compiler-interpreter • Trimedia code compression system • Superinstructions chosen statically to minimize code size Superinstructions used to reduce dispatch overhead • Forth, Ocaml • Superinstructions chosen dynamically * Proebsting 1995 * Hoogerbrugge 1999 * Piumarta 1998 * Ertl 2003
Conclusion • Naive approach to superinstructions does not produce substantial benefit for Ruby • Floating point values boxing overhead is a problem of Ruby • Superinstructions provide some help (up to 22%) Future work • Eliminate float boxing further • Specializing computation loop