180 likes | 195 Views
Low Overhead Program Monitoring and Profiling. Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania 15260 {naveen, childers}@cs.pitt.edu. Naveen Kumar, Bruce Childers. Mary Lou Soffa. Department of Computer Science University of Virginia
E N D
Low Overhead Program Monitoring and Profiling Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania 15260 {naveen, childers}@cs.pitt.edu Naveen Kumar, Bruce Childers Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville, Virginia 22904 soffa@virginia.edu
Introduction • Program instrumentation: Insertion of additional code into a program • Monitor program behavior or gather information • Can be inserted at source intermediate or binary level • Applications • Detect program invariants [Ernst] • Dynamic slicing [Zhang] • Software testing [Misurda] • Software security checks [Scott]
Running Example • Consider a software security system that monitors the memory behavior of untrusted programs (e.g. Dynamo RIO) • Instrumentation at binary instruction level • Instrument all loads and stores • Program can be instrumented statically as well as dynamically
Static instrumentation probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) probe1: M[r[sp] + -20 ] = r[l0] save call save_gp_regs … r[o0] = M[r[sp] + 0x68 ] r[o0] = r[o0] +0x10 call secure r[o1] = r[g0] + 1 call restore_gp_regs restore r[sp] = r[sp] + 124 M[r[l0 ]+ 0x10 ] = r[o2] jmp probe1_ret r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Example from gzip. Instrumentation performed before execution starts
Dynamic instrumentation probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Instrumentation performed at run-time on code that executes More powerful than static instrumentation, possibly less expensive
Motivation • Stumbling block: high overhead • Slowdown by an order of magnitude or more [Ernst] • Existing solutions: user guided • Sampling [Arnold] • Smaller data sets analyzed (test data set of SPEC instead of Ref) [Mock] • Less aggressive uses, especially in dynamic settings [Deusterwald] • User has to decide how best to apply instrumentation • What is needed are automatic techniques to mitigate the overheads systematically
Goals • Gather exact information • Separate out the accuracy from efficiency • User should focus on what to gather, rather than how to efficiently gather • Efficient • Comparable to hand-optimized instrumentation • Automatic • No or little user guidance
Instrumentation Optimization • Costs associated with instrumentation • Dynamic probe count: Number of probes executed • Probe cost: Number of instructions in a probe • Payload cost: Frequency of invocation and cost of payload • Optimize instrumentation code to reduce costs • Dynamic probe coalescing • Partial context switches • Partial payload inlining
Base Instrumenter probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Base instrumenter generates a list of Instrumentation Points
Dynamic Probe Coalescing probe5: call secure(…) call secure(…) probe3: call secure(…) probe4: call secure(…) probe6: call secure(…) call secure(…) call secure(…) probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe5 jmp probe2 jmp probe3 jmp probe6 jmp probe4
Partial Context Switch probe6: call secure(…) call secure(…) call secure(…) probe4: call secure(…) probe6: M[r[sp] -20 ] = r[l0] M[r[sp] -28 ] = r[o1] save call save_gp_regs … effective address … call secure … effective address … call secure … effective address … call secure call restore_gp_regs restore … … jmp probe6_ret r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe6 jmp probe4 Analyze register usage in payload Remove spill and reload of GP registers Regs. used in payload: {…} Not used: {g0…g7}
Partial Payload Inlining probe6: M[r[sp] -20 ] = r[l0] M[r[sp] -28 ] = r[o1] r[sp] = r[sp] -140 … effective address … call secure … effective address … call secure … effective address … call secure r[sp] = r[sp] + 140 … … jmp probe6_ret void secure(address) { if(address > REDZONE) return; redAlerts++; createReport(); if(critical(address)) assert(address); } r[o1] = M[r[g1]+0] r[o1] = r[o1] - r[o0] r[i0] = 1 jmp r[31] … r[o3] = M[r[g2] +0] r[o3] = r[o3] + 1 … !call createReport … !call assert call __full_secure void __inlined_secure(address) { r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … __full_secure(address, tag); } void __full_secure(address, tag) { jmp probe6 jmp probe4
Implementation • Strata: dynamic translation system [Scott et. al.] • Generates code at run-time for an application • Suitable for dynamic instrumentation • FIST: base instrumentation system [Kumar et. al.] • Flexible for diverse instrumentation needs • Generates a list of instrumentation points (IP’s) • INS-OP: developed in this work • Constructs an IR for the list of IP’s obtained from FIST • Each optimization is a pass that modifies the IR
Case Studies • Case study 1: Program profiling • Lightweight instrumentation application • Lower initial overhead implies lesser benefits • Demonstrates efficacy of the optimizations in an unfavorable scenario • Case study 2: Memory simulation • Relatively heavy-weight instrumentation application • Can compare with state-of-the-art systems to see the benefits of optimization
Case study 1: Program profiling • The benefit of optimization varies; depends upon the initial overhead • The speedups range from 1.26 to 2.63
Case study 2: Memory Simulation • Strata-Embra is a SPARC implementation of cache simulator from SimOS • Strata-Embra-Opt is optimized cache simulator using INS-OP • INS-OP optimizes the fastest cache simulator we could find by 2 - 3.3 times
Conclusions • Introduced “instrumentation optimization” to reduce the cost of instrumented code • Reduced probe count • Reduce cost of an individual probe • Reduce the cost of payload • Speedups between 1.2 - 3.3 times • More detailed information gathering • Accuracy need not be sacrificed for efficiency • Feasibility of certain applications • Run-time monitoring more feasible • Example: applications that perform continuous testing