Low Overhead Program Monitoring and Profiling

Low Overhead Program Monitoring and Profiling Department of Computer Science University of Pittsburgh Pittsburgh, Pennsylvania 15260 {naveen, childers}@cs.pitt.edu Naveen Kumar, Bruce Childers Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville, Virginia 22904 soffa@virginia.edu

Introduction • Program instrumentation: Insertion of additional code into a program • Monitor program behavior or gather information • Can be inserted at source intermediate or binary level • Applications • Detect program invariants [Ernst] • Dynamic slicing [Zhang] • Software testing [Misurda] • Software security checks [Scott]

Running Example • Consider a software security system that monitors the memory behavior of untrusted programs (e.g. Dynamo RIO) • Instrumentation at binary instruction level • Instrument all loads and stores • Program can be instrumented statically as well as dynamically

Static instrumentation probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) probe1: M[r[sp] + -20 ] = r[l0] save call save_gp_regs … r[o0] = M[r[sp] + 0x68 ] r[o0] = r[o0] +0x10 call secure r[o1] = r[g0] + 1 call restore_gp_regs restore r[sp] = r[sp] + 124 M[r[l0 ]+ 0x10 ] = r[o2] jmp probe1_ret r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Example from gzip. Instrumentation performed before execution starts

Dynamic instrumentation probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Instrumentation performed at run-time on code that executes More powerful than static instrumentation, possibly less expensive

Motivation • Stumbling block: high overhead • Slowdown by an order of magnitude or more [Ernst] • Existing solutions: user guided • Sampling [Arnold] • Smaller data sets analyzed (test data set of SPEC instead of Ref) [Mock] • Less aggressive uses, especially in dynamic settings [Deusterwald] • User has to decide how best to apply instrumentation • What is needed are automatic techniques to mitigate the overheads systematically

Goals • Gather exact information • Separate out the accuracy from efficiency • User should focus on what to gather, rather than how to efficiently gather • Efficient • Comparable to hand-optimized instrumentation • Automatic • No or little user guidance

Instrumentation Optimization • Costs associated with instrumentation • Dynamic probe count: Number of probes executed • Probe cost: Number of instructions in a probe • Payload cost: Frequency of invocation and cost of payload • Optimize instrumentation code to reduce costs • Dynamic probe coalescing • Partial context switches • Partial payload inlining

Base Instrumenter probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe2 jmp probe3 jmp probe4 Base instrumenter generates a list of Instrumentation Points

Dynamic Probe Coalescing probe5: call secure(…) call secure(…) probe3: call secure(…) probe4: call secure(…) probe6: call secure(…) call secure(…) call secure(…) probe1: call secure(…) probe2: call secure(…) probe3: call secure(…) probe4: call secure(…) r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe1 jmp probe5 jmp probe2 jmp probe3 jmp probe6 jmp probe4

Partial Context Switch probe6: call secure(…) call secure(…) call secure(…) probe4: call secure(…) probe6: M[r[sp] -20 ] = r[l0] M[r[sp] -28 ] = r[o1] save call save_gp_regs … effective address … call secure … effective address … call secure … effective address … call secure call restore_gp_regs restore … … jmp probe6_ret r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … jmp probe6 jmp probe4 Analyze register usage in payload Remove spill and reload of GP registers Regs. used in payload: {…} Not used: {g0…g7}

Partial Payload Inlining probe6: M[r[sp] -20 ] = r[l0] M[r[sp] -28 ] = r[o1] r[sp] = r[sp] -140 … effective address … call secure … effective address … call secure … effective address … call secure r[sp] = r[sp] + 140 … … jmp probe6_ret void secure(address) { if(address > REDZONE) return; redAlerts++; createReport(); if(critical(address)) assert(address); } r[o1] = M[r[g1]+0] r[o1] = r[o1] - r[o0] r[i0] = 1 jmp r[31] … r[o3] = M[r[g2] +0] r[o3] = r[o3] + 1 … !call createReport … !call assert call __full_secure void __inlined_secure(address) { r[o1] = r[o1] << 10 r[o1] = r[o1] + 0x228 r[o0] = r[o2] << 0x14 r[l4] = r[o0] << 0x14 M[r[l0 ]+ 0x10 ] = r[o2] M[r[o1] + 0x228 ] = r[o0] r[i4] = r[o1] r[l1] = r[o0] jmp r[31] … M[r[l0] + 0x20 ] = r[o0] r[sp] = r[sp] -112 r[o0] = r[o0] << 10 r[o1] = M[r[o0] + 0x3d0 ] … … __full_secure(address, tag); } void __full_secure(address, tag) { jmp probe6 jmp probe4

Implementation • Strata: dynamic translation system [Scott et. al.] • Generates code at run-time for an application • Suitable for dynamic instrumentation • FIST: base instrumentation system [Kumar et. al.] • Flexible for diverse instrumentation needs • Generates a list of instrumentation points (IP’s) • INS-OP: developed in this work • Constructs an IR for the list of IP’s obtained from FIST • Each optimization is a pass that modifies the IR

Case Studies • Case study 1: Program profiling • Lightweight instrumentation application • Lower initial overhead implies lesser benefits • Demonstrates efficacy of the optimizations in an unfavorable scenario • Case study 2: Memory simulation • Relatively heavy-weight instrumentation application • Can compare with state-of-the-art systems to see the benefits of optimization

Case study 1: Program profiling • The benefit of optimization varies; depends upon the initial overhead • The speedups range from 1.26 to 2.63

Case study 2: Memory Simulation • Strata-Embra is a SPARC implementation of cache simulator from SimOS • Strata-Embra-Opt is optimized cache simulator using INS-OP • INS-OP optimizes the fastest cache simulator we could find by 2 - 3.3 times

Conclusions • Introduced “instrumentation optimization” to reduce the cost of instrumented code • Reduced probe count • Reduce cost of an individual probe • Reduce the cost of payload • Speedups between 1.2 - 3.3 times • More detailed information gathering • Accuracy need not be sacrificed for efficiency • Feasibility of certain applications • Run-time monitoring more feasible • Example: applications that perform continuous testing

Effectiveness of optimizations

Low Overhead Program Monitoring and Profiling

Low Overhead Program Monitoring and Profiling

Presentation Transcript

Monitoring Program

SQL Server 2008 – Profiling and Monitoring Tools

Monitoring and program evaluation

Low-Overhead Byzantine Fault-Tolerant Storage

Overhead

XenMon : QoS Monitoring and Performance Profiling Tool

Low-Overhead Memory Leak Detection Using Adaptive Statistical Profiling

Low-Overhead Memory Leak Detection Using Adaptive Statistical Profiling

Program Profiling: Applications, Algorithms and Tools

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Low Overhead Real-Time Computing

Low Overhead Fault Tolerant Networking (in Myrinet)

Low Overhead Pilot Structures

Scalable Low Overhead Delay Estimation

Low Overhead Interrupt Handling with SMT

PROFILING, TRACKING AND REPORTING Profiling and Segmentation

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems

Depth Profiling with Low-Energy Nuclear Resonances

Monitoring Evaluation and Research program

Low Overhead Debugging with DISE

Profiling, Tracing, Debugging and Monitoring Frameworks