370 likes | 613 Views
Binary Translation Using Peephole Superoptimizers. Sorav Bansal, Alex Aiken Stanford University. Binary Translation. Allow one ISA to run on another Applications Portability (e.g., running legacy software) Virtualization Backward and Forward Compatibility On-chip binary translation
E N D
Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University
Binary Translation • Allow one ISA to run on another • Applications • Portability (e.g., running legacy software) • Virtualization • Backward and Forward Compatibility • On-chip binary translation • Java Virtual Machines
powerpc app x86 app x86 app powerpc app x86 app x86 app Binary Translator powerpc OS x86 OS OS Binary Translator Hypervisor x86 hardware x86 hardware x86 app x86 app powerpc app Binary Translator OS x86 hardware Binary Translation
Performance Large Complex ISAs Retargetability OS Compatibility Binary Translation Wish-list
Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion
Superoptimization • Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code Eg.
Superoptimization • Enumerate all sequences up to a certain length and • Compare each enumerated sequence with target function for equivalence
Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion
Peephole Superoptimization Use a superoptimizer to automatically infer peephole optimizations pattern replace-with Table of Peephole Optimizations [S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]
a.out 01000100101111010001110110101110101010001010101000101010001000101010100101010010101010100101000010101111110110010101010110111101001010100101010010101001010100111001111101001000110111101101110101000100110101010101010101010101010101010101010011010010010101010101010101010100001111110101011110101000111101010101110111011011101110111011101010011011001010101101101… 01100101 Peephole SuperoptimizerStep 1 Harvest instruction sequences that can potentially be optimized. Canonicalize and store them. Target Sequences
Peephole SuperoptimizationStep 2 Brute force Optimization Optimized Sequences Target Sequences
Equivalence Test Two sequences Execution Test pass Boolean Test equivalent fail fail not-equivalent not-equivalent
Peephole SuperoptimizationStep 3 Table of Peephole Optimizations
Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion
ppcx86 register map r1eax r1eax r1eax; r2ecx Application to Binary Translation • Our approach: Use lots of peephole transformations pattern (ppc) translate-to (x86) addi r1,r1,1 inc %eax mullw r1,r1,2 add r1,r1,r2
Peephole Binary Translation source arch. (ppc) register map destination arch. (x86)
Register Map Selection • The best code may require changing the register map from one code point to another • The choice of register maps affects the choice of instruction selection and vice-versa
At entry: r1Mr1 ; r2Mr2 At exit: r1Mr1 ; r2Mr2 Cost Model Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10 Register Map Selection Example ? x86 sequence: powerpc sequence: P0 li r1, 123 addi r2, r2, 1 subf r2, r1, r2 ori r1, r1, 31 P1 P2 P3 exit
0 r1 Mr1 movl $123, Mr1 10 0 r2 Mr2 incl Mr2 10 10 r1 Mr1 ; r2 eax subl Mr1, eax 10 0 r1 Mr1 orl $31, Mr1 10 10 Total 20 Total 40 Grand Total 60 Register Map Selection Greedy Strategy entry r1 Mr1 ; r2 Mr2 P0: li r1, 123 P1: addi r2,r2,1 P2: subf r2,r1,r2 P3: ori r1,r1,31 r1 Mr1 ; r2 Mr2 exit Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10
10 r1 eax movl $123, eax 1 10 r2 ecx incl ecx 1 0 r1 eax; r2 ecx subl eax, ecx 1 0 r1 eax orl $31, eax 1 20 r1 Mr1 ; r2 Mr2 Total 40 Total 4 Grand Total 44 Register Map Selection Optimal Solution entry r1 Mr1 ; r2 Mr2 li r1, 123 addi r2,r2,1 subf r2,r1,r2 ori r1,r1,31 exit Switching Costs RM or MR : 10 Instruction costs If accesses memory, 10 Else, 1
Register Map Selection • Use Dynamic Programming • near-optimal solution • account for translations spanning multiple instructions • simultaneously perform instruction-selection and register-mapping
Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion
Powerpc X86 Translator Implementation • Superoptimizer • Use a PPC emulator (Qemu) for execution test • Use a SAT solver (zChaff) for boolean test • Static user-level translator • ELF 32-bit ppc/Linux binary ELF 32-bit x86/Linux binary • Translate most (but not all) system calls
Compiler Optimizations • Problem:PowerPC optimizer staggers data-dependent instructions to reduce pipeline stalls • Solution: Cluster data-dependent instructions in basic block before translation • Endianness: ppc big-endian ; x86 little-endian • Convert all memory writes to big-endian (source) • Convert all memory reads to little-endian (dest) Implementation • Many Issues • Condition Codes, Endianness, System Calls, Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations
Experimental Results • Setup • Pentium4 3.0 GHz, 1MB Cache, 4GB Memory • gcc 4.0.1, glibc 2.3.6 • Use soft-float library • Statically-linked input executables • Benchmarks • Microbenchmarks, SPEC CINT2000 • Metrics • Compare against natively-compiled code • Compare against other binary translators • Qemu, Apple’s Rosetta
Experimental Setup • For our experiments • there are around 750 translation rules in the peephole table • the translation table is computed offline and it can take up to a week to compute the peephole rules
Experimental Results:Setup C source gcc <options> -arch=ppc gcc <options> -arch=x86 PowerPC executable x86 executable Peephole Binary Translation Compare x86 executable
Microbenchmarks Percentage of native (%) avg: 90% of native
Experimental Results: Microbenchmarks • We sometimes outperform native performance on these small benchmarks! • gcc generates better code for powerpc primarily because it has the luxury of many registers • Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.
Experimental Results:SPEC CINT2000 Percentage of native (%)
Comparisons with Qemu and Rosetta • Qemu • Use same PowerPC and x86 executables as used for our own translator • Rosetta • Runs on Mac OS X and hence supports on Mac executables • Recompiled the benchmarks on Mac using the same compiler version (gcc 4.0.1) • Mac Hardware: Intel Core 2 Duo 1.83GHz processor, 32KB L1-cache, 2MB L2-cache and 2GB memory
Comparisons with Qemu and Rosetta -O0 -O2 qemu rosetta peep avg: 3% faster than rosetta avg: 12% faster than rosetta
Translation Time • Takes 2-6 minutes to translate a 650KB executable (around 100K instructions) • majority of time spent in optimal register map computation • It is possible to reduce this to <10 seconds • For 98K instructions (<0.01% of time), use any register map. Fast (<1second) • For other 2K, use optimal computation
Conclusions and Future Work • A scheme to perform efficient binary translation using a superoptimizer • Competitive performance • Simplified Design • Other applications • Just-in-time compilation • Machine virtualization