700 likes | 855 Views
Hardware and Software Tracing. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Trace Collection Methodologies. Hardware Monitors and instrumentation Microcode Software Trap-based system Emulators
E N D
Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
Trace Collection Methodologies • Hardware • Monitors and instrumentation • Microcode • Software • Trap-based system • Emulators • Code annotation (source, object, executable) • Direct execution
Metrics for Evaluating Trace Collection Methodologies • Speed – trace capture rate • Memory – extra memory used • Accuracy – address perturbation • Intrusiveness – tracing overhead • Completeness – OS, interrupts, libraries • Granularity – smallest traceable unit • Flexibility – ease of use • Portability – platform dependence • Capacity – trace storage space • Cost - $$, time
Hardware Monitors • Capture trace at peak execution rates • Challenge - match storage media speed to tracing needs utilizing interleaving and multiplexing • Pros: • Non-intrusive • Accurate • Complete • Cons: • Expensive • Limited probeability • Limited trace length
Examples of Hardware Monitors • Monster – (U. of Michigan 1992) – R2000 traces using a DAS9200 • BACH (BYU, 1992) – i486, Pentium SPARC, 68K – developed a customized pod – being used by Intel today • Real-time Tracer (IBM 1992) – Customized SRAM array • National Instruments (2006) – provides a family of programmable instrumentation monitors
Microcode-based Tracing • Places hooks in microcode to capture machine state • Pros: • Complete (OS, application) • Minimal slowdown (2-10x) • Cons: • Microcode is dated technology • Nonportable
Example Microcode-based Tracing • ATUM (Stanford 1986) – VAX traces • PatchWrx (DEC WRL 1995, NU 1996) – Complete OS-rich traces on Alpha running NT
Participants • Chakib Ouarraoui – EMC • Jason Casmira – Intel • John Fraser – US Air Force • David Hunter – VMWare • Sharon Smith – HP • Richard Sites – Adobe Systems
Tracing tools that capture OS activity
OS Rich and NT-based Instrumentation Tools • SimOS • UNIX-based platforms – (basis for VMWare) • OS, memory, I/O activity • High overhead (10X - 50,000X) • Etch • Intel x86-based platform • No OS activity • 35X slowdown
PatchWrx Overview • Dynamic execution tracing tool suite • Captures full system workloads • Traces branches executed by the processor • Reconstructs full instruction stream • DEC Alpha 21064 Windows NT 4.0 platforms • Low overhead with minimum slowdown • 2X while running • 4X while tracing
PatchWrx Components • PALcode – Alpha Privileged Architecture Library • Reserves trace buffer upon boot • Captures trace info • Facilitates long branches • Patch – instrument all NT images • Trace – collect runtime information • Reconstruct – reconstitute the information
Patching an Image • Instrument all WinNT binary image types • COM, EXE, DLL, SYS, DRV • Replace branch-type instructions with branches to PatchWrx PAL calls • Log trace entry of branch type into buffer • Branch to original target
Patching an Image ORIGINAL IMAGE PATCHED IMAGE A A’ PAL 1 B B 4 3 2 PATCH SECTION PWX PAL BR
Patching Large Images • Normal Alpha ISA branch instruction • (PC+4) + SEXT(disp21) * 4 • New PatchWrx long branches • LBR (PC+4) + SEXT(disp25) * 4 • LBSR (PC+4) + ZEXT(disp20) * 32
Patching Large Images LONG PATCHED IMAGE 1 A’ PAL 6 B 2 4 3 5 PATCH SECTION CAPTURE PWX PAL BR
Tracing with PatchWrx • Trace • User controlled start/stop/dump • Dumps captured trace to binary file • Captures VA mapping snapshot of active processes during trace capture
Reconstructing Execution IMAGE n IMAGE 0 I-STREAM AND/OR D-STREAM RAW TRACE . . . . RECONSTRUCT TOOL VA MAP SYMBOL TABLE 0 SYMBOL TABLE n
OS-Rich Workload Characterization • Execution domain analysis • Hot EXEs / DLLs (system resources) • Instruction mix • Application-only • Full system • Branching behavior • Branch frequency (average basic block size) • Branch prediction in presence of OS
Five most frequently used images in each benchmark or application
Conditional Branch Prediction 2-level BTB, 12-bit PHR, 4096 entries, gshare
Summary of Results • Benchmarks execute almost entirely within the application domain • Desktop applications execute across many images and interact with the kernel and system DLLs • Branch prediction accuracy can change drastically (sometimes it can even improve) when the operating system interaction is considered • The instruction mix in desktop applications changes significantly in the presence of OS • Increased number of indirect branches and privileged instructions (e.g., PALcalls)
For Further Information 1. “Tracing and Characterization of Windows NT-based System Workloads,” J.P. Casmira, D.P. Hunter and D.R. Kaeli, Digital Technical Journal, Vol. 10, No. 1, 1998, pp. 6-21 (www.digital.com/info/DTJ01/DTJ01HM.HTM). 2. “Operating System Impact on Trace-Driven Simulation,” J.P. Casmira, J. Fraser and D.R. Kaeli, Proceedings of the 31st Simulation Symposium, Boston, MA, April 1998, pp. 76-82. 3. “A Code Annotation Tool for Capturing Operating System Execution,” J.Fraser, Northeastern University Technical Report, NUCAR_6-97-1, June 1997 (on the NUCAR website). http://www.ece.neu.edu/groups/nucar
Trap Based • Interrupt the application at selected points in order to save trace records • Pros: • Available on many CPUs • Portable • Inexpensive • Cons: • Considerable slowdown (1000x) • Intrusive (ISR), especially when considering real-time events • How we decide where to interrupt the processor and still maintain a representative trace?
Example Trap Based Systems • VAX-Tracer – Clark&Emer study on VAX • OS2-Tracer – Intel 386 • Wisconsin Wind Tunnel – ECC error trapping – CM5 (SPARC) • Tapeworm II system – ECC error trapping – OS trap handler
Emulators • Simulating the target ISA using one or a multiple machine instructions on the host ISA • Pros: • Minimal slowdown (10-100x) • Opportunity for JIT compilation • Portable • Flexible – software controlled • Cons: • Serious programming effort needed • Extra memory needed • Typically single process tracing
Emulators • Shade (UW 1994) – dynamic translation • Compiles emulated instructions to native instructions (many elements of Shade have shown up in Transmeta products) • Host – SPARC-V8 • Targets – SPARC-V8, SPARC-V9, MIPS • Spa (Sun 1993) – Iterative interpretation • Reinterprets instructions on each occurrence • Host – MIPS-1 • Targets – MIPS-1, MIPS-2 • SPIM (U of Wisc 1991) – predecoded interpretation • Provides pointers to instruction handler and operands to speed decoding • Hosts – SPARC, 680x0, MIPS, HP-PA • Target – MIPS-1
More Recent Emulators • VisualDSP (Analog Devices 1995-present) • Simulator for SHARC and BlackFin DSPs that runs on WinTel and Linux-x86 • Provides C/C++ compilation environment • Statistical profiling • Cycle-accurate simulator • Provides a full visualization environment for machine performance • AMD Opteron X86-64 (2003) • Simulator for the new 64-bit X86 from AMD • Runs on 32-bit Linux-x86 • Comes complete with a X86-64 version of gcc • http://www.x86-64.org/
MP Emulators • MINT (University of Rochester 1994) • Predecoded interpretation – memory references • Host – R3000 (SGI, DECstations) • Target – R3000, (an Alpha-based derivative was developed called AINT) • RSim (Rice Univ 1997) – Simulator for high-ILP Multiprocessors • Detailed cycle-based emulation • Host – SPARC, SGI PowerChallenge • Target – MIPS R10K
Machine Emulators • Simics (1996-present) Virtutech • Developed out research work at SICS • Provides a large number of CPU targets • Alpha, ARM, Itanium, MIPS, Pentium, PowerPC, SPARC, X86-64 • Provides both detailed simulation/emulation and high throughput • http://www.simics.com/ • SimOS (1997) Stanford University • Originally designed to run on an SGI platform • Actually boots a full operating system (SGI IRIX and DEC UNIX) • Implementations on Alpha and MIPS platforms • Designed around the operating system, emulating IO and other system-related events • Provided the base technology for VMWare products
Code Annotation • Instrumented program produces trace while the application is run • Three levels of annotation • Source code modification • Object code modification • Binary code modification • Pros: • Ease of implementation • Small slowdown (10x) • Inexpensive • Cons: • Limited completeness (OS, multiprocessing) • May not capture DLLs • Memory dilation
Source Code Annotation • TRAPEDS (Univ. of Illinois 1989) • Adds a call upon exit from a basic block • MPTrace (Univ. of Washington 1990) • I386, instruments only MP-relevant events • Tangolite (Stanford 1993) • Annotates all memory events in an MP environment
Object Code Annotation • Epoxie (DEC WRL 1989) – Titan MP • Epoxie2 (DEC WRL 1993) – R3000 • ATOM (DEC WRL 1994) – Alpha • Alto (Univ. of Arizona 1996) – Alpha • PLTO (Univ. of Arizona 2001) – IA32
Binary Code Annotation • Pixie (DEC 1991) – MIPS • Goblin (IBM/CMU 1991) – RS/6000 • IDtrace (Univ. of Mich.) – i486 • QPT (Univ. of Wisc.) – MIPS, SPARC • EEL (Univ. of Wisc.) – MIPS, SPARC • DSPTune (NEU) – ADI SHARC DSP • Pin (Intel 2005) – X86, XScale, Itanium
Embedded Systems Profiling Tools • Enhance current embedded system compilation environments, providing profile-driven analysis and feedback capabilities • DSPTune - instrumentation and analysis package for the SHARC family of DSPs • Allows for full instrumentation of C and C++ codes at the source, assembly and ELF binary levels • Supported by Analog Devices and the NSF
The DSPTune Toolset • A set of library routines that enable the user to instrument C and assembly programs • Function calls can be inserted at various locations in the application code, enabling execution driven simulation • The user provides: • instrumentation routines, which specify the selected instrumentation events (e.g., loads, branches, traps) • analysis routines, which carry out the desired simulation (e.g., caches, stacks, branch predictors)
User application code Step I Parser User instrumentation code IntermediateRepresentation Step II Instrumenting Tool InstrumentedIR Step III Code Generator User analysis code Instrumented application code Step IV Assembler Linker Instrumented application executable
BDSPTune • Provides similar capabilites as DSPTune • Allows ELF binaries to be instrumented • Enable instrumentation and profiling to include library routines
Counter-based Profiling and Instrumentation David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
Counters are used to: • Identify Performance Bottlenecks • especially unpredictable dynamic stallse.g. cache misses, branch mispredicts, TLB misses, etc. • complex out-of-order processors make this difficult • Guide Optimizations • help programmers understand and improve code • automatic, profile-driven optimizations • Profile Production Workloads • low overhead • transparent • profile whole system
Performance Counters • Interfaced through a device driver and supporting GUI (e.g., VTune) • Counters increment based on a set of events of interest (e.g., cache misses, pipeline stalls) • Interrupt will occur that signals that the counter has overflowed • An interrupt service routine reads the counter information and tags it to a program counter (PC) value • Information is then available for offline analysis
Performance Counters • Low overhead method for obtaining performance and profiling information • Typically less than 5% slowdown • Requires no modification of the binary • May require root level access to system • Lacks precision in cause/affect analysis • Come for free on most ISAs • Commonly used today to measure performance and estimate power usage
Counter Library • A number of counter libraries are available to provide an API to program and access common architectures • Rabbit • for Intel/AMD Processors and Linux • URL: www.scl.ameslab.gov/Projects/Rabbit/ • PAPI • Linux IA32, IA64 • Allows counters to be captured on a per thread basis • URL: icl.cs.utk.edu/projects/papi/