1 / 13

Studying the performance of the FX!32 binary translation system

Explore the history, features, and performance of the FX!32 binary translation system implemented by Compaq Computer and Northeastern University in 1999. Understand the transparency and flow of information within the system to improve application compatibility and efficiency.

Download Presentation

Studying the performance of the FX!32 binary translation system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Studying the performance of the FX!32 binary translation system Paul Drongowski, David Hunter, Compaq Computer Morteza Fayyazi, David Kaeli, Northeastern University Jason Casmira, University of Colorado 16 October 1999

  2. History and goals • Run x86 architecture WIN32 applications • History • First released in October 1996; v1.5 shipped in July 1999 • Over 13,000 copies downloaded from FX!32 web site • Factory-installed software on Alpha/NT workstations • Transparency • Applications install in the expected way • Applications launch in the expected way • Applications interoperate with Alpha components • Good performance relative to contemporary x86 machines

  3. FX!32 information flow Transparency Agent Runtime x86 Images Translated Images Execution Profiles Translator FX!32 Server

  4. Translation CALL targets Indirect flow edges x86 semantics Discover code Parse x86 instructions Expand condition code semantics Expand overlaid register semantics Fold x86 address modes Unaligned references Improve and lower Select Alpha code sequences Allocate registers Schedule instructions a semantics Assemble into translated image

  5. Cycles D- CPI Address Instruction 227 75 2.0 0050022b8 ldl t3, 0(s5) 2384 223 2.1 0050022bc ldah t3, -1(t3) 1072 217 1.0 0050022c0 lda t3, 393c(t3) 1066 1 1.0 0050022c4 beq t3, 05002380 1060 3 0050022c8 beq t11, 05002300 33 0 0.0 0050022cc ldq t0, 0(t11) 2146 441 2.0 0050022d0 subl t0, s5, at 1055 4 1.0 0050022d4 bne at, 05002300 1047 2 1.0 0050022d8 srl t0, #20, t10 1049 2 1.0 0050022dc beq t10, 05002300 1150 24 1.0 0050022e0 ldq t0, 0(t10) 42269 2854 38.6 0050022e4 ldl t1, 1584(t4) 52 12 0.0 0050022e8 subl t0, s5, at 2121 192 1.9 0050022ec bis at, t1, at 1041 4 0.9 0050022f0 bne at, 05002300 0 0 0050022f4 sra t0, #20, t2 992 2 0050022f8 bic t2, #1, t2 0 0 0050022fc bne t2, 050027a0 “DCPI” System-wide profiling Samples a performance counters (cycles, cache misses, stalls, etc.) Coarse grain and code-level Analyzes and displays performance information Summarize by image Generate annotated disassembly Analyze WRT a hardware model Suggest likely causes of problem Example: Cache conflict with Emulator service Continuous Profiling Infrastructure

  6. Cycles Cum D-miss I-miss Image 4913764 40.9% 380397 461276 EXCEL.OPT 1961649 57.3% 178329 297983 win32k.sys 1436843 69.2% 103842 177213 ntoskrnl.exe 695633 75.0% 39039 49348 mga.dll 611960 80.1% 2367 4459 hal.dll 408758 83.5% 69384 45325 wx86cpu.dll 326307 86.3% 21719 35684 ntdll.dll 278700 88.6% 28840 49635 GDI32.dll 221205 90.4% 17176 7949 rasdd.dll 192651 92.0% 16989 29623 Ntfs.sys 185204 93.6% 35126 3648 dcpisvc.exe 150015 94.8% 11351 20066 jacket.dll 147607 96.1% 10371 13203 KERNEL32.dll 139328 97.2% 14534 17779 USER32.dll 115245 98.2% 8208 18563 MSVCRT.dll 99084 99.0% 7368 13899 MSO95.OPT 22399 99.2% 1509 2771 RPCRT4.dll 13625 99.3% 1248 1413 MSTEST40.DLL 11050 99.4% 383 1008 ANALYSIS32.OPT 7639 99.5% 439 510 tcpip.sys 6881 99.5% 563 939 SHELL32.dll 6633 99.6% 627 507 dec_malmd_ns.dll 5210 99.6% 356 444 fx32agnt.dll 5207 99.7% 563 299 loader.dll FX!32 components Translated images (*.OPT) Emulator (wx86cpu.dll) API jackets (jacket.dll) Loader (loader.dll) Transparency agent (fx32agnt.dll) Measurement overhead DCPI (dcpisvc.exe) Script driver (MSTEST40.DLL) Display/user interface (27%) Emulator breakdown (3.4%) Emulation (9%) Control (48%) String support (37%) FP support (6%) Sysmark32 Excel workload

  7. Approach Assess benefit of MMX on x86 Identify key MMX operations Develop emulation routines Add code generation to Translator Measure, evaluate, iterate 21164 vs. 21264 64-bit logical instructions (a) Multi-media instructions (21264) ITOF / FTOI instructions (21264) Assessment / investigation Begin with code templates Dual entry subroutines Pass arguments (results) to (from) translated code via registers MMX: Approach Emulator Dual-entry MMX Routines Translated Code

  8. MMX: Value representation • Difficult trade-off to make in legacy system • Constraints • No free registers in Emulator • Store / load penalty on 21164 hosts; ITOF / FTOI on 21264 hosts • Trade-off • MMX values in a FP: Move to integer side with penalty on 21164 • MMX values in a integer registers: Fewer registers for allocation • MMX values in memory: Higher memory traffic, potentially slower due to D-cache misses • Represent MMX values in a FP registers • More registers for allocation in translated code • Remove store / load through memory analysis

  9. MMX: Measurement • FACET operation (500MHz 21264 faster than 266MHz PII) • MMX enabled on 21264 hosts, but not 21164 hosts (v1.5) • Eliminate store/load penalty (planned for v1.6) • MMX in Emulator wins on both 21164 and 21264

  10. Tracing and instrumentation • PatchWrx • Static binary rewriting tool for capturing full (application, DLL and OS) instruction and data address traces on Alpha Windows NT • Traces of FX!32 used to perform trade-off analysis during architectural exploration • NT-ATOM • Based on the TRU64 Unix ATOM tool • Allows selective instrumentation of executables and dynamic link libraries on Alpha Windows NT • Provides a set of API functions for efficient execution-driven simulation

  11. Predictability of selected branches inthe Emulator

  12. Tracing FX!32 with PatchWrx • Application: Sample 3-D graphics program arm2.exe • After translation • Greater then 99% of the instructions are in HAL, s2, OpenGL • High branch prediction rate (97.2%) • Average basic block length (5.8 instructions) • Jacketing OpenGL benefits execution time • Jacket strategy (choice of interfaces to jacket) • Minimal approach: Only OS interface is jacketed • FX!32 approach: Jacket support libraries as well as OS • Makes full use of Alpha libraries obtaining speed • More jackets to design, implement, test and maintain, however • Reduce cost through tooling to generate jackets automatically

  13. Conclusions • Need tools for program understanding • Binary translation operates on stripped images • Code analysis and debugging is quite difficult • Need visualization tools • Translated images are not separated by procedure descriptors • DCPI produces large volume of detailed information • Integrate and interpret data from multiple tools • Sampling and instrumentation are complementary techniques • Possibilities for improved analysis and new kinds of analysis (e.g. debugging Emulator, multithreaded application) • Feedback-directed optimization

More Related