Binary Translation Using Peephole Superoptimizers

Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University

Binary Translation • Allow one ISA to run on another • Applications • Portability (e.g., running legacy software) • Virtualization • Backward and Forward Compatibility • On-chip binary translation • Java Virtual Machines

powerpc app x86 app x86 app powerpc app x86 app x86 app Binary Translator powerpc OS x86 OS OS Binary Translator Hypervisor x86 hardware x86 hardware x86 app x86 app powerpc app Binary Translator OS x86 hardware Binary Translation

Performance Large Complex ISAs Retargetability OS Compatibility Binary Translation Wish-list

Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

Superoptimization • Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code Eg.

Superoptimization • Enumerate all sequences up to a certain length and • Compare each enumerated sequence with target function for equivalence

Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

Peephole Superoptimization Use a superoptimizer to automatically infer peephole optimizations pattern replace-with Table of Peephole Optimizations [S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]

a.out 01000100101111010001110110101110101010001010101000101010001000101010100101010010101010100101000010101111110110010101010110111101001010100101010010101001010100111001111101001000110111101101110101000100110101010101010101010101010101010101010011010010010101010101010101010100001111110101011110101000111101010101110111011011101110111011101010011011001010101101101… 01100101 Peephole SuperoptimizerStep 1 Harvest instruction sequences that can potentially be optimized. Canonicalize and store them. Target Sequences

Peephole SuperoptimizationStep 2 Brute force Optimization Optimized Sequences Target Sequences

Equivalence Test Two sequences Execution Test pass Boolean Test equivalent fail fail not-equivalent not-equivalent

Peephole SuperoptimizationStep 3 Table of Peephole Optimizations

Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

ppcx86 register map r1eax r1eax r1eax; r2ecx Application to Binary Translation • Our approach: Use lots of peephole transformations pattern (ppc) translate-to (x86) addi r1,r1,1 inc %eax mullw r1,r1,2 add r1,r1,r2

Peephole Binary Translation source arch. (ppc) register map destination arch. (x86)

Register Map Selection • The best code may require changing the register map from one code point to another • The choice of register maps affects the choice of instruction selection and vice-versa

At entry: r1Mr1 ; r2Mr2 At exit: r1Mr1 ; r2Mr2 Cost Model Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10 Register Map Selection Example ? x86 sequence: powerpc sequence: P0 li r1, 123 addi r2, r2, 1 subf r2, r1, r2 ori r1, r1, 31 P1 P2 P3 exit

0 r1  Mr1 movl $123, Mr1 10 0 r2  Mr2 incl Mr2 10 10 r1  Mr1 ; r2  eax subl Mr1, eax 10 0 r1  Mr1 orl $31, Mr1 10 10 Total 20 Total 40 Grand Total 60 Register Map Selection Greedy Strategy entry r1  Mr1 ; r2  Mr2 P0: li r1, 123 P1: addi r2,r2,1 P2: subf r2,r1,r2 P3: ori r1,r1,31 r1  Mr1 ; r2  Mr2 exit Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10

10 r1  eax movl $123, eax 1 10 r2  ecx incl ecx 1 0 r1  eax; r2  ecx subl eax, ecx 1 0 r1  eax orl $31, eax 1 20 r1  Mr1 ; r2  Mr2 Total 40 Total 4 Grand Total 44 Register Map Selection Optimal Solution entry r1  Mr1 ; r2  Mr2 li r1, 123 addi r2,r2,1 subf r2,r1,r2 ori r1,r1,31 exit Switching Costs RM or MR : 10 Instruction costs If accesses memory, 10 Else, 1

Register Map Selection • Use Dynamic Programming • near-optimal solution • account for translations spanning multiple instructions • simultaneously perform instruction-selection and register-mapping

Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

Powerpc  X86 Translator Implementation • Superoptimizer • Use a PPC emulator (Qemu) for execution test • Use a SAT solver (zChaff) for boolean test • Static user-level translator • ELF 32-bit ppc/Linux binary  ELF 32-bit x86/Linux binary • Translate most (but not all) system calls

Compiler Optimizations • Problem:PowerPC optimizer staggers data-dependent instructions to reduce pipeline stalls • Solution: Cluster data-dependent instructions in basic block before translation • Endianness: ppc big-endian ; x86 little-endian • Convert all memory writes to big-endian (source) • Convert all memory reads to little-endian (dest) Implementation • Many Issues • Condition Codes, Endianness, System Calls, Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations

Experimental Results • Setup • Pentium4 3.0 GHz, 1MB Cache, 4GB Memory • gcc 4.0.1, glibc 2.3.6 • Use soft-float library • Statically-linked input executables • Benchmarks • Microbenchmarks, SPEC CINT2000 • Metrics • Compare against natively-compiled code • Compare against other binary translators • Qemu, Apple’s Rosetta

Experimental Setup • For our experiments • there are around 750 translation rules in the peephole table • the translation table is computed offline and it can take up to a week to compute the peephole rules

Experimental Results:Setup C source gcc <options> -arch=ppc gcc <options> -arch=x86 PowerPC executable x86 executable Peephole Binary Translation Compare x86 executable

Microbenchmarks

Microbenchmarks Percentage of native (%) avg: 90% of native

Experimental Results: Microbenchmarks • We sometimes outperform native performance on these small benchmarks! • gcc generates better code for powerpc primarily because it has the luxury of many registers • Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.

Experimental Results:SPEC CINT2000 Percentage of native (%)

Comparisons with Qemu and Rosetta • Qemu • Use same PowerPC and x86 executables as used for our own translator • Rosetta • Runs on Mac OS X and hence supports on Mac executables • Recompiled the benchmarks on Mac using the same compiler version (gcc 4.0.1) • Mac Hardware: Intel Core 2 Duo 1.83GHz processor, 32KB L1-cache, 2MB L2-cache and 2GB memory

Comparisons with Qemu and Rosetta -O0 -O2 qemu rosetta peep avg: 3% faster than rosetta avg: 12% faster than rosetta

Translation Time • Takes 2-6 minutes to translate a 650KB executable (around 100K instructions) • majority of time spent in optimal register map computation • It is possible to reduce this to <10 seconds • For 98K instructions (<0.01% of time), use any register map. Fast (<1second) • For other 2K, use optimal computation

Conclusions and Future Work • A scheme to perform efficient binary translation using a superoptimizer • Competitive performance • Simplified Design • Other applications • Just-in-time compilation • Machine virtualization

Q&A Thank you.

Backup Slides

Binary Translation Using Peephole Superoptimizers

Binary Translation Using Peephole Superoptimizers

Presentation Transcript

Emulation - Binary Translation

Peephole Optimization

Challenges in Binary Translation for Desktop Supercomputing

Thread-Safe Dynamic Binary Translation using Transactional Memory

Binary star research using „ microtelescopes “

QEMU Binary Translation

Peephole Contest 2008

Emulation - Binary Translation

Dynamic Binary Translation

Challenges in Binary Translation for Desktop Supercomputing

Thread-Safe Dynamic Binary Translation using Transactional Memory

Fast Dynamic Binary Translation for the Kernel

Binary Translation

Background Optimization in Full System Binary Translation

Emulation: Interpretation and Binary Translation

Binary Translation

Binary Translation Using Peephole Superoptimizers

Binary Translation and Applications

Comprehensive Kernel Instrumentation via Dynamic Binary Translation

Get Money Using Binary Option

Peephole Installation

Binary Obfuscation Using Signals