1 / 37

Binary Translation Using Peephole Superoptimizers

Binary Translation Using Peephole Superoptimizers. Sorav Bansal, Alex Aiken Stanford University. Binary Translation. Allow one ISA to run on another Applications Portability (e.g., running legacy software) Virtualization Backward and Forward Compatibility On-chip binary translation

beate
Download Presentation

Binary Translation Using Peephole Superoptimizers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Binary Translation Using Peephole Superoptimizers Sorav Bansal, Alex Aiken Stanford University

  2. Binary Translation • Allow one ISA to run on another • Applications • Portability (e.g., running legacy software) • Virtualization • Backward and Forward Compatibility • On-chip binary translation • Java Virtual Machines

  3. powerpc app x86 app x86 app powerpc app x86 app x86 app Binary Translator powerpc OS x86 OS OS Binary Translator Hypervisor x86 hardware x86 hardware x86 app x86 app powerpc app Binary Translator OS x86 hardware Binary Translation

  4. Performance Large Complex ISAs Retargetability OS Compatibility Binary Translation Wish-list

  5. Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

  6. Superoptimization • Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code Eg.

  7. Superoptimization • Enumerate all sequences up to a certain length and • Compare each enumerated sequence with target function for equivalence

  8. Talk Outline Superoptimization Peephole Superoptimization Application to Binary Translation Implementation & Experimental Results Conclusion

  9. Peephole Superoptimization Use a superoptimizer to automatically infer peephole optimizations pattern replace-with Table of Peephole Optimizations [S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]

  10. a.out 01000100101111010001110110101110101010001010101000101010001000101010100101010010101010100101000010101111110110010101010110111101001010100101010010101001010100111001111101001000110111101101110101000100110101010101010101010101010101010101010011010010010101010101010101010100001111110101011110101000111101010101110111011011101110111011101010011011001010101101101… 01100101 Peephole SuperoptimizerStep 1 Harvest instruction sequences that can potentially be optimized. Canonicalize and store them. Target Sequences

  11. Peephole SuperoptimizationStep 2 Brute force Optimization Optimized Sequences Target Sequences

  12. Equivalence Test Two sequences Execution Test pass Boolean Test equivalent fail fail not-equivalent not-equivalent

  13. Peephole SuperoptimizationStep 3 Table of Peephole Optimizations

  14. Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

  15. ppcx86 register map r1eax r1eax r1eax; r2ecx Application to Binary Translation • Our approach: Use lots of peephole transformations pattern (ppc) translate-to (x86) addi r1,r1,1 inc %eax mullw r1,r1,2 add r1,r1,r2

  16. Peephole Binary Translation source arch. (ppc) register map destination arch. (x86)

  17. Register Map Selection • The best code may require changing the register map from one code point to another • The choice of register maps affects the choice of instruction selection and vice-versa

  18. At entry: r1Mr1 ; r2Mr2 At exit: r1Mr1 ; r2Mr2 Cost Model Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10 Register Map Selection Example ? x86 sequence: powerpc sequence: P0 li r1, 123 addi r2, r2, 1 subf r2, r1, r2 ori r1, r1, 31 P1 P2 P3 exit

  19. 0 r1  Mr1 movl $123, Mr1 10 0 r2  Mr2 incl Mr2 10 10 r1  Mr1 ; r2  eax subl Mr1, eax 10 0 r1  Mr1 orl $31, Mr1 10 10 Total 20 Total 40 Grand Total 60 Register Map Selection Greedy Strategy entry r1  Mr1 ; r2  Mr2 P0: li r1, 123 P1: addi r2,r2,1 P2: subf r2,r1,r2 P3: ori r1,r1,31 r1  Mr1 ; r2  Mr2 exit Instruction costs If accesses memory, 10 Else, 1 Switching Costs RM or MR : 10

  20. 10 r1  eax movl $123, eax 1 10 r2  ecx incl ecx 1 0 r1  eax; r2  ecx subl eax, ecx 1 0 r1  eax orl $31, eax 1 20 r1  Mr1 ; r2  Mr2 Total 40 Total 4 Grand Total 44 Register Map Selection Optimal Solution entry r1  Mr1 ; r2  Mr2 li r1, 123 addi r2,r2,1 subf r2,r1,r2 ori r1,r1,31 exit Switching Costs RM or MR : 10 Instruction costs If accesses memory, 10 Else, 1

  21. Register Map Selection • Use Dynamic Programming • near-optimal solution • account for translations spanning multiple instructions • simultaneously perform instruction-selection and register-mapping

  22. Talk Outline Superoptimization Peephole Superoptimizers Application to Binary Translation Implementation & Experimental Results Conclusion

  23. Powerpc  X86 Translator Implementation • Superoptimizer • Use a PPC emulator (Qemu) for execution test • Use a SAT solver (zChaff) for boolean test • Static user-level translator • ELF 32-bit ppc/Linux binary  ELF 32-bit x86/Linux binary • Translate most (but not all) system calls

  24. Compiler Optimizations • Problem:PowerPC optimizer staggers data-dependent instructions to reduce pipeline stalls • Solution: Cluster data-dependent instructions in basic block before translation • Endianness: ppc big-endian ; x86 little-endian • Convert all memory writes to big-endian (source) • Convert all memory reads to little-endian (dest) Implementation • Many Issues • Condition Codes, Endianness, System Calls, Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations

  25. Experimental Results • Setup • Pentium4 3.0 GHz, 1MB Cache, 4GB Memory • gcc 4.0.1, glibc 2.3.6 • Use soft-float library • Statically-linked input executables • Benchmarks • Microbenchmarks, SPEC CINT2000 • Metrics • Compare against natively-compiled code • Compare against other binary translators • Qemu, Apple’s Rosetta

  26. Experimental Setup • For our experiments • there are around 750 translation rules in the peephole table • the translation table is computed offline and it can take up to a week to compute the peephole rules

  27. Experimental Results:Setup C source gcc <options> -arch=ppc gcc <options> -arch=x86 PowerPC executable x86 executable Peephole Binary Translation Compare x86 executable

  28. Microbenchmarks

  29. Microbenchmarks Percentage of native (%) avg: 90% of native

  30. Experimental Results: Microbenchmarks • We sometimes outperform native performance on these small benchmarks! • gcc generates better code for powerpc primarily because it has the luxury of many registers • Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.

  31. Experimental Results:SPEC CINT2000 Percentage of native (%)

  32. Comparisons with Qemu and Rosetta • Qemu • Use same PowerPC and x86 executables as used for our own translator • Rosetta • Runs on Mac OS X and hence supports on Mac executables • Recompiled the benchmarks on Mac using the same compiler version (gcc 4.0.1) • Mac Hardware: Intel Core 2 Duo 1.83GHz processor, 32KB L1-cache, 2MB L2-cache and 2GB memory

  33. Comparisons with Qemu and Rosetta -O0 -O2 qemu rosetta peep avg: 3% faster than rosetta avg: 12% faster than rosetta

  34. Translation Time • Takes 2-6 minutes to translate a 650KB executable (around 100K instructions) • majority of time spent in optimal register map computation • It is possible to reduce this to <10 seconds • For 98K instructions (<0.01% of time), use any register map. Fast (<1second) • For other 2K, use optimal computation

  35. Conclusions and Future Work • A scheme to perform efficient binary translation using a superoptimizer • Competitive performance • Simplified Design • Other applications • Just-in-time compilation • Machine virtualization

  36. Q&A Thank you.

  37. Backup Slides

More Related