440 likes | 590 Views
Dynamic Analysis: Looking Back and the Road Ahead. Trishul Chilimbi Runtime Analysis & Design (RAD ) Research in Software Engineering ( RiSE ) Microsoft Research. Dynamic Analysis Breakdown. Measurement Representation Analysis. Measurement Methodology. Program. Compiler.
E N D
Dynamic Analysis: Looking Back and the Road Ahead Trishul Chilimbi Runtime Analysis & Design (RAD) Research in Software Engineering (RiSE) Microsoft Research
Dynamic Analysis Breakdown Measurement Representation Analysis WODA '09
Measurement Methodology Program Compiler Executable Machine Source Instrumentation Binary Instrumentation Instruction Emulation Hardware Instrumentation ATOM (PLDI’04) EEL (PLDI’05) Dynamo (PLDI’00) DCPI (SOSP’97) WODA '09
Measurement Efficiency • Hardware performance counters • DCPI (SOSP’97) • Sampling • Bursty Tracing (PLDI’01, FDDO’01) • Program Analysis • Path Profiling (MICRO’96) WODA '09
Representation • Raw • Trace • Structured • Path Profile (MICRO’96) • Whole Program Paths (PLDI’99) • Whole Program Data Accesses (PLDI’01) • Custom • Eraser’s Lock Set (SOSP’97) WODA '09
Analysis • Performance • Profiling and profile-driven optimization • Correctness • Bug detection, heap and concurrency checkers • Security • Security monitors, Taint Analysis WODA '09
Dynamic Analysis: The Road Ahead • Industrial-strength dynamic analysis • Scaling dynamic analysis to process and analyze large quantities of data • System Level • Data Centers, Multi-core WODA '09
Scaling Dynamic Analyses • System level analysis • Instrumentation • Event Tracing for Windows (ETW) • Data volume • Statistical Analysis • Visualization WODA '09
ETW Tracing Infrastructure • General purpose real-time event logging facility • Core component of Windows operating systems starting with Windows 2000, continually extended and improved • High speed • 1200 to 2000 cycles per logging • Low overhead • less than 5% of the total CPU cycles for 20,000 events/sec • Works for both user mode applications and drivers • Immune to app crashes and hangs • Writes to a file or to a real time listener • Dynamically enabled or disabled • No re-compile, no reboots, no app restarts, … • Designed for app tracing in production mode • Scalable WODA '09
ETW Architecture Controller Session Control Enable/Disable Events … Event Tracing Sessions Trace files Session 1 Session 2 Session 64 Buffers Logged Events Real time delivery Events Provider A Provider C Consumer Consumer Provider B WODA '09
OS Process/thread activity Module load Disk and File IO TCPIP/UDPIP Pagefault Registry Context Switch Heap and Critical Section Server applications Active Directory IIS6 File Server Print Server Exchange Server ETW Performance Diagnostics WODA '09
ETW Statistics • Kernel logger outputs: • ~100K events in minutes • ~200KB binary file • ~100MB text dump • Multiple traces/day • Expertanalysis • Processing a trace file: a few minutes • Manual diagnosis time: sometimes minutes, sometimes hours • Manual diagnosis cannot keep up with rate of trace collection WODA '09
Scaling Dynamic Analyses • System level analysis • Instrumentation • Data volume • Statistical Analysis • Visualization WODA '09
HangViz • Lock/resource contention lies at the root of many performance problems • Kernel manages most resources – not visible to application developer • Our solution • Start from an observed hang • Pull out all relevant lock-related waits, represented as a directed acyclic graph (DAG) • Highlight critical path • Provide visualization tool for further exploration • Iterative feedback cycle • Joint work with Alice Zheng, Steve Hsaio, David Andrzewejski WODA '09
HangViz Outline Constructing the Ready DAG Finding the critical path Visualization WODA '09
Constructing A Ready DAG • Relevant ETW events • CSwitch: context switches • ReadyThread: thread releasing resource • Stack: lock functions • Currently ETW does not track lock object ID • Stack functions are used to differentiate between different locks, but the signature is not perfect • Sequence of wait and run intervals and ReadyThread signals can be represented as a directed acyclic graph (DAG) WODA '09
Example: Simple Ready Chain Outlook UI (waiting) Outlook UI (running) ReadyThread file lock SearchIndexer (waiting) SearchIndexer (running) ReadyThread file lock eTrust (running) WODA '09
Complications: Non-Immediate Waits The immediate ready chain may not be the root cause of the problem WODA '09
Example: Ready Tree Outlook UI (waiting) Outlook UI (running) ReadyThread file lock SearchIndexer (waiting) SI (running) SI (waiting) SI (running) ReadyThread file lock ReadyThread registry lock eTrust (running) Systems thread (running) WODA '09
Solution: Follow Overlapping Waits • Look at all ready chains during the long wait • Follow any wait of the parent thread (e.g., SearchIndexer) that overlaps with the child wait • Repeat on parent thread • Optional search depth to limit branching factor WODA '09
More Complications: False Runs • The thread runs, but not because the resource has been released • Timer wake up – thread wakes up, lock is still not available, thread goes back to sleep • APC – thread is woken up to execute code for someone else • Bottomline: timer wake ups and APCs do NOT terminate the wait, should be counted towards total wait time WODA '09
Example: APCs Outlook UI (waiting) Outlook UI (running) ReadyThread file lock SI (waiting) SI (running) SI (waiting) SI (running) SI (waiting) SI (running) SearchIndexer (waiting) ReadyThreadAPC ReadyThread Systems thread IExplorer (running) WODA '09
Finding Individual Critical Waits • Algorithm for finding individual critical waits • Bucket wait times by their lock set (set of lock-related functions on the stack) • For each lock set, build probabilistic model of wait time • Gaussian, exponential, Gamma, or mixture of Gaussians • Select the best model for each lock set • A long wait is critical if it has extremely low probabilities under the model WODA '09
Probabilistic Model of Wait Times EnterCriticalRegion Wait Time Histogram Mixture of Gaussians Model 7 us 10829 us 41 s Low probability! WODA '09
Finding the Critical Path • Ready DAGs can be complex • But there should be only one critical path • One resource holding up the entire chain (for example, network or I/O) • Multiple threads on the chain are experiencing long waits • Critical path probably has longest average wait time • Other possible metrics • maximum wait time: might be shared among multiple paths • longest chain: could have many short waits • longest chain with longest average wait time • Possible expansion to cross-trace analysis WODA '09
Screen Shot I Generated ReadyTree (anomalous waits highlighted in red) WODA '09
Screen Shot (close-up) WODA '09
Screen Shot (Annotation) Changing anomaly annotation WODA '09
AllocRay: (with George Robertson, VIBE) BAD GOOD • Investigate memory footprint, WS, fragmentation, leaks WODA '09 A picture is worth a “million words” [of trace data] Heap Allocation “movies” expose problems Easy to use and supports deep exploration Observe instantaneous program behavior
AllocRay Heap Allocation “Movies” • Colors and filters help focus on different behaviors • Memory footprint • Fragmentation • Pixels are tied to events and call stacks to facilitate investigation WODA '09
Scaling Dynamic Analyses • Data centers • 10,000+ machines running web services such as search, mail, online shopping • Large opportunity for dynamic analyses to reduce data center operations cost • 10,000 x 100 metrics/minute -> 10+GB/day WODA '09
Statistical Debugging (Liblit et al. PLDI’03) • Algorithm sketch • Collect code profiles for a large number of successful and failing runs of the program • Find code fragments that strongly correlate with failure • Cause & correlation • Correlation implies causation, a logical fallacy! • Example : error handling code • Statistical debugging – build a statistical model of program outcome that discriminates cause from correlation WODA '09
Holmes (Chilimbi et al. ICSE’09) Statistical model … if (y=0) { x = x + 1 } … Statistical analysis Path profiles from successful and failing runs … … Bug predictors (likely root cause) WODA '09
Statistical analysis • Differentiate cause from correlation • Key idea – find path fragments that strongly correlate with failure but the context in which the fragment occurs does not Context of a path foo(x, y) a b c d e f WODA '09
Statistical model • Inputs • A set of path profiles, one for each run • Each run’s outcome (success/failure) • Compute four statistics for each path • So(p), Fo(p) : number of successful/failing runs in which context of path p was executed • Se(p), Fe(p) : number of successful/failing runs in which path p was executed WODA '09
Statistical model How much is the context of a path correlated with failure? Measure of how many failures does a path occur in? How much more is the path correlated with failure? Overall measure that combines sensitivity and increase (specificity) WODA '09
Holmes in actionEDG C++ compiler Importance Context Increase WODA '09
Branches, predicates AND pathsHow close do they get you? Study of 45 bugs in 6 applications from the SIR benchmark suite Path profiles take you down the right path! WODA '09
Bug-directed Adaptiveprofiling Production environment myapp.dll Profiles … Holmes profiling tools Statistical analysis Bug reports Bug predictors Holmes backend Root cause while (is_eof_token(ch) { } if (id == 1) { } Static analysis myapp.cpp
Adaptive Profiling • Bootstrapping • Stack traces • Branch profiles • Iterative Profiling • Additional function selection using coupling • Strengthening weak predictors with richer profiles WODA '09
HOLMES: Non-Adaptive Vs Adaptive WODA '09
HOLMES: ADAPTIVE OVERHEADS Time Overhead (%) Space Overhead (%) WODA '09
Dynamic Analysis & Data Centers • Data center environment is more controlled • System level Vs. Application level metrics • What is the analogue of paths that provides context? • Need predictive capability to take action • Reboot, Reimage, Notify operator WODA '09
Conclusion • Dynamic analyses have been successfully used to improve program performance, reliability, and security • Efficient measurement • Need to scale dynamic analysis to industrial strength to address challenges posed by system-level analysis, multi-core, and data centers • Efficient data management and analysis • Data management: Database/ Map-Reduce style processing • Statistical Analysis Techniques WODA '09