Advanced Performance Forensics

AdvancedPerformance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture stephen.feldman@blackboard.com

Sessions Goals The goals of today’s session are… • Introduce the practice of performance forensics. • Present an argument for session level analysis. • Discuss the difference between Resources and Interfaces. • Present tools that can be used for performance forensics at different layers of the architectural stack and the client layer.

Definition of Performance Forensics • The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem. • In context of a performance (response time problem) • Discussing an individual event (session experience) • Performance problems can be classified in two main categories: • Response Time Latency • Queuing Latency

Performance Forensics Methodology

Putting Performance Forensics in Context • Emphasis on the user and the user’s actions and experiences. • How can this be measured? • Capture the response time experience and the response time expectations of the user. • Put into perspective user action in-line with the goals of Method-R (what’s most important to the business) • Identify the contributors of response latency • Everyone needs to be involved

Measuring the Session • When should this happen? • When a problem statement cannot be developed from the data you do have (evidence or interviews) and more data needs to be collected. • How should you go about this? • Want to minimize disruption to the production environment. • Adaptive collection: Less Intensive to More Intensive over time. Basic Sampling Continuous Collection Profiling

Resources vs. Interfaces • One of the most critical data points to collect • Interfaces are critical for understanding throughput and queuing models. • Queuing is another cause of latency • Also a cause of time-outs • Resources are critical for understanding the cost of performing a transaction. • Core Resources: CPU, Memory and I/O • Response Time = Service Time + Queue Time

The Importance of Wait Events • Rise of Session Level Forensics • Underlying theme with all of these tools that “Session” is more important then “System” • Wait event tuning used to account for latency • Exists in SQL Server (Waits and Queues) and Oracle (10046) • Other components not mature enough to represent • Waits are statistical explanations of latency • Each individual wait event might be deceiving, but looking at both aggregates and outliers can explain why a performance problem exists. • When sampling directly, usually only have about 1 hour to act on the data.

Performance Forensics Tools

Categories of Tools • HTTP and User Experience • JVM Instrumentation Tools • Database Instrumentation • Session and Wait Event • Cost Execution Plans • Profilers

Breaking Down Latency

Fiddler2 • Fiddler 2 measures end-to-end client responsiveness of a web request. • Little to no overhead (less intrusive forensics) • Captures requests in order to present http codes, size of objects, sequence of loading, time to process request, performance by bandwidth speed. • Rough estimation of User Experience based on locality. • Inspects every detail of the http request • Detailed session inspection • Breakdown of http transformation • Other Tools in Category: Y-slow/Firebug, Charlesproxy, liveHTTPheaders and IEInspector

Coradiant Truesight • Commercial tool used for passive user experience monitoring. • Captures page, object and session level data. • Capable of defining Service Level Thresholds and Automatic Incident Management. • Used to trace back session as if you were watching over the user’s shoulder. • Exceptional tool for trend analysis. (Less Intrusive) • Primarily used in forensics as evidence for analysis. • Other Tools in the Category: Quest User Experience and Citrix EdgeSight

Coradiant Truesight

Log Analyzers • Both commercial and open source tools are available to parse and analyze http access logs. • Provides trend data, client statistical data, http summary information. • Recommend using this data to study request and bandwidth trends for correlation purposes with resource utilization graphs. • Such a large volume of data. • Recommend working within small time slices • Post-processing tool (No Impact to Application) • Examples: Urchin, Summary, WebTrends, SawMill, Surfstats and AlterWind Log Analyzer

JSTAT • Low intrusive statistic collector that provides • Percentages of usage by each region • Frequency/Counts of collections • Time spent in pause state • Can be invoked any time without restarting the JVM by obtaining the Process ID • Exception is on Windows when the JVM is run as a background service • Critical for understanding windows of stall times between sampling • Assume you collect every 5 seconds and observe a 3 second pause time • Means the application could only work for 2 seconds

JSTAT

Process of Garbage Collection

-VerboseGC and -Xloggc • JVM flags that invoke JVM logging • Verbose JVM logging is a low-overhead collector (less intrusive measurement) • Requires a restart of the instance to run • -XX:+PrintGCDetails is a recommended setting to be used with: • -XX:+PrintGCApplicationConcurrentTime • -XX:+PrintGCApplicationStoppedTime • Provides aggregate statistics about Pause Times versus Working Times.

-VerboseGC and -Xloggc

IBM Pattern Modeling Tool for Java GC • Post processing tool used for visualizing a –VerboseGC or –Xloggc file. • Can make the analysis efforts for analyzing a log file substantially easier. • Represents pauses/stalls at particular times • Has no affect on the application environment as it reads a log file that is dormant.

IBM Pattern Modeling Tool for Java GC

JHAT, JMAP and SAP Memory Analyzer • Jhat: Java Heap Analysis Tool takes a heap dump and parses the data into useful and human-digestible information about what's in the JVM's memory. • JMap: Java Memory Map is a JVM tool that provides information about what is in the heap at a given time. • Provides text and OQL views into JHat data • SAP Memory Analyzer will visualize the JHat output • Should be run when a problem is occurring right now • When the system is unresponsive • When the JVM runs into continuous collections

ASH • ASH: Active Session History • Samples session activity in the system every second. • 1 hour of history in memory for immediate access at your fingertips • ASH in Memory • Collects active session data only • History v$session_wait + v$session + extras • Circular Buffer - 1M to 128M (~2% of SGA) • Flushed every hour to disk or when buffer 2/3 full (it protects itself so you can relax) • Tools to Consider: SessPack and SessSnaper

SQL Server Performance Dashboard • Feature of SQL Server 2005 SP2 • Template report that take advantage of DMVs • Provides views into wait events • Doesn’t link events to SQL IDs in the report • Provides aggregate views of wait events • Session Level DMVs (sys.dm_os_wait_stats and sys.dm_exec_sessions) • Complimentary Tools: SQL Server Health and History Tool and Quest Spotlight for SQL Server

Importance of Cost Execution Plans • Can be run on databases with low overhead • Do not need the literal values to run • Both SQL Server and Oracle can run “Estimated Cost Plans” • Each database uses an “Optimizer” that determines the best path of execution of SQL • Calculates IO, CPU and Number of Executes (Loop Conditions) • Understanding cost operations on a particular object can help change your tuning strategy (ex: TABLE ACCESS BY INDEX ROWID) • Cost is time • Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration.

RML and Profiler • The RML utilities process SQL Server trace files and view reports showing how SQL Server is performing. • Which application, database or login is using the most resources, and which queries are responsible for that. • Whether there were any plan changes for a batch during the time when the trace was captured and how each of those plans performed. • What queries are running slower in today's data compared to a previous set of data • Profiler captures statements, query counts/statistics, wait events • Can capture and correlate profile data to Perfmon data • Heavy overhead with both • Other Tools to Consider: Quest Performance Analysis for SQL Server

Oracle OEM and 10046 • Oracle finally delivered with OEM with a web-based interface. • Performance dashboard provides great historical and present overview • Access to ADDM and ASH simplifies job of DBA • SQL History • Problems • licensing somewhat cost prohibitive • Still doesn’t provide wait events • For 10046 still need to consider profiling on your own and using a profiler reader like Hotsos P4. • Difficult to trace and capture sessions

Want More? • Check-out my blog for postings of the presentation: http://sevenseconds.wordpress.com • To view my resources and references for this presentation, visit www.scholar.com • Simply click “Advanced Search” and search by sfeldman@blackboard.com and tag: ‘bbworld08’ or ‘forensics’

Advanced Performance Forensics

Advanced Performance Forensics

Presentation Transcript

Computer Forensics and Advanced Topics

Forensics

Advanced Pay for Performance

Forensics

FORENSICS

Forensics

Forensics

Forensics

Forensics

Forensics

Advanced Graphics: Performance

FORENSICS

Forensics

Forensics

Forensics

Forensics

Performance Forensics

Forensics

Forensics

digital forensics, network forensics, mobile forensics, cloud forensics, database forensics, digital forensics market, c

Forensics

Forensics