520 likes | 670 Views
Quiz. Wei Hsu 8/16/2006. Which of the following instructions are speculative in nature?. Data cache prefetch instruction Non-faulting loads Speculative loads (e.g. ld.s) Advance load (e.g. ld.a) Stores. Answer: A, B, C, D. Which of the following motivate dynamic optimization.
E N D
Quiz Wei Hsu 8/16/2006
Which of the following instructions are speculative in nature? • Data cache prefetch instruction • Non-faulting loads • Speculative loads (e.g. ld.s) • Advance load (e.g. ld.a) • Stores Answer: A, B, C, D
Which of the following motivate dynamic optimization • When the underlying micro-architecture is different from what the object code is compiled for. • When the program behaves very differently on different input data • When the application is large, and has a very flat profile. • When the application is written in C/C++. Answer: A, B
Which of the following may increase MLP? • A larger instruction re-ordering window for OOO processors • Use code scheduling to overlap delinquent loads if the processor uses stall-on-use model • Inserting cache prefetches for multiple delinquent loads • Decrease the associativity of the cache • Using a helper thread running on the second core Answer: A, B, C, E
Montecito is a dual-core CMP, but the two cores do not share on-chip caches (L1/L2/L3), how may we use help threads ? • We may use VMT that switches the main thread to a helper thread on a L3 cache miss. • It is hard to use the other core for helper threads since the synchronization overhead is high • It is possible to use the other core to warm up the off-chip L4 shared cache, if there is one. • It is possible to use the other core to warm up near memory side caches. Answer: A, B, C, D
Dynamic Instrumentation Techniques and their Applications Wei Hsu 8/16/2006
Program Instrumentation • Instrumentation A technique for inserting extra code (or probes) into an application to observe its behavior • Program measurements (profiles, value profiles) • Trace generator (e.g. branch trace, memory trace) • Protection (program introspection) • Emulation (cache simulator, Shade) • Migration (e.g. PA 1.1 2.0) • Debugging tools (Pure software, purify, memory checker)
Instrumentation Time • In Source Code • For communicating high-level domain specific abstraction to tools • Portable across multiple compilers and platforms • At Compile Time • compiler inserts instrumentations, e.g. using –p, -pg, -a, -ax … flags. • At Post-Link Time • Often referred to as binary editing tools e.g. Atom, EEL, Pixie • No need to recompile applications • Language independent • Will not be invalidated or affected by compiler optimization • At Runtime • Instrumentation during program execution • No recompilation, no re-linking, no restarting • Can be inserted and/or removed at runtime, no disabled probe effect • Requires continuous porting efforts as computing platforms evolve
Static Binary Editing Tools • ATOM on DEC (COMPAQ) Unix • NTATOM on WindowsNT • HiProf/TracePoint (performance tools based on ATOM) • Pixie for MIPS • Etch (Instrumentation/optimization for Win32/x86 apps) • OM system (for link time optimizations, developed at DEC) • Spike for Alpha • iSPike for Itanium • CacheProf (now rolled into valgrind, a popular Linux tool set for profiling and debugging, based on dynamic instrumentation) • UQBT (resourceable and retargetable binary translator) • EEL
EEL:Machine-Independent Executable Editing • EEL (Executable Editing Library) is a C++ library that hides much of the complexity and system-specific detail of editing executables. • Applications appear unchanged, and data collected as a side effect of execution. • Qpt/Qpt2 are tracing tools based on EEL. Qpt’s performance is better than Shade (Qpt is 2-6x slower than native execution even with tracing)
Instrumentation Example If (A > B) { bb[0]++; A = 1; } Else { bb[1]++; A = 0; }
Executable Editing • Typical binary editing • Decompose • Build IR • Insert Instrumentation • Convert IR to executable • EEL’s approach • Abstractions: executable, routine, CFG, instruction, snippet • Adding snippets to a routine’s CFG • Produce a new version of the routine from the edited CFG
Executable Editing • Obstacle • Address are bound • Registers are bound Example if (a) a = b Bnz %r1, .+2 Ld _b, %r1 Insert Ld _counter1, %rx Add %rx, 1, %rx St %rx, _counter Q: Is reg %rx free? How about the branch inst offset?
Handling Registers Insert Ld _counter1, %rx Add %rx, 1, %rx St %rx, _counter Bnz %r1, .+2 Ld _b, %r1 • If there are free registers (dead at the point of insertion), the • editor could replace %rx by the free register. • 2) If no free registers, a wrapper routine must be used to spill • %rx to the stack. • EEL uses data flow analysis to identify free registers – not trivial.
Handling Addresses Insert Ld _counter1, %rx Add %rx, 1, %rx St %rx, _counter Bnz %r1, .+2 Ld _b, %r1 How to change the address in the branch instructions? EEL uses control flow analysis to change addresses in branches calls, and jumps One alternative is to change the Ld instruction to a branch to the instrumented code segment (like a procedure call) so that addresses of other branch instructions remain the same.
Handling Addresses Ld _counter1, %rx Add %rx, 1, %rx St %rx, _counter Ld _b, %r1 ret Bnz %r1, .+2 Call xxxxx Pros No need for CFG, no adjustment to addresses in branch/jump instructions Cons Less efficient instrumented code Don’t know how to handle variable length instructions
EEL Abstractions • Executable • Object file, library, static or dynamically linked programs • Use symbol table information, but do not rely on it • Analysis to identify all routines • Using the symbol table to form the initial set of routines • If no symbol table, the initial set contains only the program entry address and the first location in the text segment • Examine instructions to locate jumps out of a routine, or calls on routines not in the initial set. • A CFG is constructed
Code Snippet • Code snippet can be coded in assembly or in high-level. It is usually coded in assembly for efficiency, but becomes machine dependent. • When a tool creates a snippet, it specifies the instructions, two register sets, and a call-back function. • Registers used in the snippet that need to be assigned unused registers • Some particular registers that EEL should not spill or assign them. • Call back function edits displacements
Editing Example: 1* sethi 0x1, %g6 2* ld [%lo(0x1)+%g6], %g7 add %g7, 1, %g7 3* st %g7, [%lo(0x1)+%g6] • EEL modifies calls, branches, and jumps to ensure correct control flows
CFG of a routine • EEL represents a routine as a CFG • Why CFG? • A profile tool, qpt required CFG to place instrumentation code on CFG edges. (what’s wrong with block counts??) • EEL uses CFG to adjust addresses in branches and jumps • CFG provides architecture-independence on control flow
Representing Delayed branches Add %r1,%r2,%r3 Bne %icc, L1 Bne %icc, L1 Add %r1,%r2,%r3 L1 Bne, a %icc, L1 Nullified delay slot Bne, a %icc, L1 Add %r1,%r2,%r3 Add %r1,%r2,%r3 L1
Incomplete CFG • When control flow cannot be completely analyzed, runtime code ensures corrected execution. • This paper claims that most indirect jumps occur in case statements (actually, most indirect branches are return jumps, shared lib calls and indirect calls). EEL uses backward slicing to find the jump table and complete the CFG. • EEL’s backward slicing makes runtime translation a rare occurrence: no unanalyzable indirect jumps in spec92 using SunOS’s compilers.
Int main(int argc, char* argv[]) { executable * exec = new executable(argv[1]); exec->read_contetnts(); routine * r; FOREACH_ROUTINE(r, exec->routines()) { instrument(r); ….} Void instrument (routine* r) { cfg* g = r->control_flow_graph(); bb* b; FOREACH_BB(b, g->blocks()) { if (1 < b->succ()->size()) { edge* e; FOREACH_EDGE(e, b->succ()) { e->add_code_along(incr_count(num)); num++; }
Dynamic Instrumentation • Many advantages over static instrumentation: • No need of a separate instrumentation pass • Can instrument all user-level codes executed • Shared libraries • Dynamically generated code • Easy to distinguish code and data • Instrumentation can be turned on/off • Can attach and instrument an already running process • No disabled probe effect
PIN: A VM based Dynamic Instrumentation Tool • It uses dynamic code generation to make a less intrusive instrumentation system • Pin has the following advantages: • Easy-to-use • Portable • Transparent • Efficient
Easy-to-use and Portable • Instrumentation tools are written in C/C++ using PIN’s API • It allows tool writers to analyze an application by inserting calls at arbitrary locations in the executable. • Users do not need to manually in-line calls or save/restore registers • PIN’s API abstract away instruction idiosyncrasies, so the tools can be portable. Various Pintools are available on IA32, Itanium, ARM, and EM64 • API also allows access to architecture-specific information
Efficient and Robust • Code caching and trace linking • Pin implements register re-allocation, inlining, liveness analysis, and instruction scheduling to instrumented code. • Pin can dynamically attaching and detaching to a process. This is important for large, long running programs. • Pin can handle mixed code and data, variable-length instructions, statically unknown indirect jump targets, dynamically loaded libraries, and dynamically generated code
FILE * trace; // Print a memory write record VOID RecordMemWrite(VOID * ip, VOID * addr, UINT32 size) { fprintf(trace,"%p: W %p %d\n", ip, addr, size); } // Called for every instruction VOID Instruction(INS ins, VOID *v) { // instruments writes using a predicated call, // i.e. the call happens iff the store is // actually executed if (INS_IsMemoryWrite(ins)) INS_InsertPredicatedCall( ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE, IARG_END); } int main(int argc, char *argv[]) { PIN_Init(argc, argv); trace = fopen("atrace.out", "w"); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); // Never returns return 0; A Sample Pintool for tracing Memory writes
Pin’s software architecture Pintool Instrumentation API Application Code Cache Virtual machine (VM) JIT Compiler Dispatcher Emulation unit Operating System Hardware
1 1’ 2 3 2’ 4 5 7’ 6 Compiler 7 Execution Drives Instrumentation Original code Code cache
1 1’ 3’ 2 3 2’ 5’ 4 5 6’ 7’ 6 Compiler 7 Execution Drives Instrumentation Original code Code cache
count(10) count(30) Instruction-level Instrumentation • Instrument relative to an instruction: • Before • After: • Fall-through edge • Taken edge (if it is a branch) cmp %esi, %edx jle <L1> mov $0x1, %edi count(20) <L1>: mov $0x8,%edi
Pin Instrumentation APIs • Basic APIs are architecture independent: • Provide common functionalities such as finding out: • Control-flow changes • Memory accesses • Architecture-specific APIs for more detailed info • IA-32, EM64T, Itanium, Xscale • ATOM-based notion: • Instrumentation routines • Analysis routines
Instrumentation Routines • User writes instrumentation routines: • Walk list of instructions, and • Insert calls to analysis routines • Pin invokes instrumentation routines when placing new instructions in code cache • Repeated execution uses already instrumented code in code cache
Analysis Routines • User inserts calls to analysis routine: • User-specified arguments • E.g., increment counter, record data address, … • User writes in C, C++, ASM • Pin provides isolation so analysis does not affect application • Optimizations like inlining, register allocation, and scheduling make it efficient
Example: Instruction Count [rscohn1@shli0005 Tests]$ hello Hello world [rscohn1@shli0005 Tests]$ icount -- hello Hello world ICount 496890 [rscohn1@shli0005 Tests]$
counter++; counter++; counter++; counter++; counter++; Example: Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2
#include <stdio.h> #include "pinstr.H" UINT64 icount=0; // Analysis Routine void docount() { icount++; } // Instrumentation Routine void Instruction(INS ins) { PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)docount, IARG_END); } VOID Fini() { fprintf(stderr,"ICount %lld\n", icount); } int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); PIN_AddFiniFunction(Fini); PIN_StartProgram(); }
Example: Instruction Trace [rscohn1@shli0005 Trace]$ itrace -e hello Hello world [rscohn1@shli0005 Trace]$ head prog.trace 0x20000000000045c0 0x20000000000045c1 0x20000000000045c2 0x20000000000045d0 0x20000000000045d2 0x20000000000045e0 0x20000000000045e1 0x20000000000045e2 [rscohn1@shli0005 Trace]$
traceInst(ip); traceInst(ip); traceInst(ip); traceInst(ip); traceInst(ip); Example: Instruction Trace mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2
#include <stdio.h> #include "pinstr.H" FILE *traceFile; void traceInst(long * ipsyll){ fprintf(traceFile, "%p\n", ipsyll); } void Instruction(INS ins){ PIN_InsertCall(IPOINT_BEFORE, ins, (AFUNPTR)traceInst, IARG_IP_SLOT, IARG_END); } int main(int argc, char *argv[]) { PIN_AddInstrumentInstructionFunction(Instruction); traceFile = fopen("prog.trace", "w"); PIN_StartProgram(); }
counter++; counter++; counter++; counter += 3; counter++; counter++; counter += 2; Example: Faster Instruction Count mov r2 = 2 add r3 = 4, r3 (p2) br.cond L1 add r4 = 8, r4 br.cond L2
ManualExamples/inscount1.C #include <stdio.h> #include "pin.H“ UINT64 icount = 0; VOID docount(INT32 c) { icount += c; } VOID Trace(TRACE trace, VOID *v) { for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_InsertCall(bbl, IPOINT_BEFORE, (AFUNPTR)docount, IARG_UINT32, BBL_NumIns(bbl), IARG_END); } } VOID Fini(INT32 code, VOID *v) { fprintf(stderr, "Count %lld\n", icount); } int main(int argc, char * argv[]) { PIN_Init(argc, argv); TRACE_AddInstrumentFunction(Trace, 0); PIN_AddFiniFunction(Fini, 0); PIN_StartProgram(); return 0; }
Instruction Information Accessed at Instrumentation Time • INS_Category(INS) • INS_Address(INS) • INS_Regr1, INS_Regr2, INS_Regr3, … • INS_Next(INS), INS_Prev(INS) • INS_BraType(INS) • INS_SizeType(INS) • INS_Stop(INS)
More Advanced Tools • Instruction cache simulation: replace itrace analysis function • Data cache: like icache, but instrument loads/stores and pass effective address • Malloc/Free trace: instrument entry/exit points • Detect out of bound stack references • Instrument instructions that move stack pointer • Instrument loads/stores to check in bound
Instrumentation is Transparent • When application looks at itself, sees same: • Code addresses • Data addresses • Memory contents • Don’t want to change behavior, expose latent bugs • When instrumentation looks at application, sees original application: • Code addresses • Data addresses • Memory contents • Observe original behavior
Pin Instruments All Code • Execution driven instrumentation: • Shared libraries • Dynamically generated code • Self modifying code • Instrumented first time executed • Pin does not detect code has been modified
Dynamic Instrumentation in Pin • While program is running: • Instrumentation can be turned on/off • Code cache can be invalidated • Reinstrumented the next time it is executed • Pin can detach and run application native • Use this for fast skip