1 / 82

Performance Forensics

Performance Forensics. Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture stephen.feldman@blackboard.com. Welcome to BbWold’08. Finishing my 5 th year at Blackboard.

Download Presentation

Performance Forensics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture stephen.feldman@blackboard.com

  2. Welcome to BbWold’08 • Finishing my 5th year at Blackboard. • Brought in to build a Performance Engineering Practice. • Team of 15 including myself • Half of the team are Performance Test Engineers • Half of the team are Software Developers • Responsible for the performance and scalability of the BbLearn architecture.

  3. Session Housekeeping • Three hours of fun and excitement. • Feel free to fire up your laptops. • We will take 1 break at the half-way point • Take a break when ever you need to • Questions are welcome at any time.

  4. Our Session Schedule • Part One: Introduction to Performance Forensics • 1:00 to 2:25pm • Break • 2:25 to 2:35pm • Part Two: Advanced Performance Forensics • 2:35pm to 4:00pm

  5. Sessions Goals The goals of today’s session are… • Introduce you to the science of performance forensics. • Present a methodology for performing forensics. • Discuss techniques for arriving at root cause analysis. • Familiarize the audience with tools that can be used to assist the forensics process.

  6. Session Learning Objectives At the end of the session you should be able to… • Write your own problem statements. • Perform the process of evidence collection and interviewing. • Apply techniques for using data and analysis to avoid diagnosis bias and value attribution. • Perform root cause analysis as part of the performance forensics process. • Begin using different tools for capturing key performance data

  7. Part One: Introduction to Performance Forensics What is forensic engineering?

  8. A Practical Definition • The term forensics means “The science and practice of collection, analysis, and presentation of information relating to a crime in a manner suitable for use in a court of law.” • This definition is in the context of a crime. • Forensic engineering is the application of accepted engineering practices and principles for discussion, debate, argumentative, or legal purposes.

  9. Introduction to Performance Forensics

  10. Definition of Performance Forensics • The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem. • Performance problems can be classified in two main categories: • Response Time Latency • Queuing Latency

  11. Cognition of Response Times

  12. Queuing Model: Visual of a Bottleneck

  13. Performance Forensics Methodology

  14. Performance Forensics Methodology

  15. Identify the Problem

  16. Identifying the Problem • Problems are not always easily identifiable. • When they are easily apparent a simple problem statement should be declared so that the investigation can commence. • Calling out symptoms not diagnosing • When the problem is not clear, narrowing down the possibilities of what the problem could be should be the appropriate course of action. • Be willing to leave the problem statement open ended until a more formulated problem statement can be attained.

  17. Problem Statements • Example Weak Problem Statement: • Sally Simpleton is experiencing response time latency in the Grade Center. • Why is it the statement weak? • Who is Sally Simpleton? • What defines response time latency? • What is she doing in the Grade Center? • When does it happen? • Can it be reproduced?

  18. Strengthen the Problem Statement • Sand College is reporting response time latency of 90 to 120 seconds when course administrators edit Grade Center cells. • The problem is reproducible when using Sally Simpleton’s login credentials and accessing her course section (Introduction to Software Performance Engineering). • The problem has been reproduced at all times of days across different course sections and on different systems.

  19. Evidence

  20. Evidence • Multiple types of gathered evidence used to solve performance problems. • Log artifacts • Monitoring/Measurement tools • Instrumentation/Sensors • Interactive evidence gathering through interviews. • Evidentiary support through discrete simulation • Improving future evidentiary capabilities by improving Performance Maturity Model

  21. Log Artifacts • Understand what logs are in place and where they can be found. • Know what they are used for and whether they provide the right information. • Keep them slim and usable. • Learn how to associate and correlate • Associate multiple log artifacts • Correlate events to the problem statement

  22. Example Log Visualization

  23. Example Log Visualization

  24. Putting Collectors/Sensors in Place • When should this happen? • When a problem statement cannot be developed from the data you do have (evidence or interviews) and more data needs to be collected. • How should you go about this? • Want to minimize disruption to the production environment. • Adaptive collection: Less Intensive to More Intensive over time. Basic Sampling Continuous Collection Profiling

  25. Monitoring and Measurement • Third party components whether commercial or open source deployed to measure responsiveness and resource utilization • Excellent tools for trending and correlation • Specialization of tools to solve different types of problems. • Used in forensics for correlation for resource utilization to event occurrences.

  26. Ex 1: Thin-Slicing Monitoring Visualizations

  27. Ex 2: Thin-Slicing Monitoring Visualizations

  28. Ex 3: Thin-Slicing Monitoring Visualizations

  29. Ex 4: Thin-Slicing Monitoring Visualizations

  30. Interviewing • Techniques • Lassie Question • Time Association • User experienced • Locality • Component/Feature Specific • Gathering non-discrete clues • Making use of method-R • Avoiding diagnosis bias • Eliminating value attribution • Can a pattern be identified?

  31. Diagnosis Bias • It is human nature to label people, ideas or things based on our initial opinions of them. • Not necessarily scientific, but rather a combination of gut feelings, irrational judgment or failure to process enough conclusive data. • We often diagnose before we can get to root cause analysis based on a hunch or perception.

  32. Value Attribution • Humans have a tendency to imbue someone or something with certain qualities based on its perceived value rather than objective data. • Example 1: The problem can’t be my SAN, I spent $250,000 on it. • Example 2: It can’t be the network, my engineers are the best in the field. They won’t allow a network problem to happen.

  33. Discrete Simulation as Evidentiary Support • Performance testing is another technique for gathering evidence. • Provides the opportunity to increase logging and watch for events or occurrences note seen originally. • Also provides the opportunity to reproduce conditions that cause the performance issue.

  34. Modeling and Visualizing

  35. Modeling and Visualizing

  36. An Abstract Example • Role of temperature in O-ring failures was difficult to determine by focusing on cases. Attention was focused on two key cases with O-ring failures: • SRM15 (cold launch) • SR22 (warm launch)

  37. Missed Opportunities for Visualizing Data

  38. Missed Opportunities for Visualizing Data

  39. Reshaping the Same Data

  40. Hypothesis versus Diagnosis • Hypothesis: A prediction or educated guess about a problem prior to proving scientifically or mathematically. • Diagnosis: A scientific, empirical or measured conclusion about a problem. • Not necessarily the correct answer, but enough data has been gathered to propose a diagnosis. • A problem statement needs to be in place for both to exist. • Both need supporting data to develop either

  41. Quick Comments About Method-R • Method-R is a preferred methodology for problem statement development and problem diagnosis. • While it was created for Oracle performance analysis, it can be applied to all aspects of software performance forensics. • Identifying the most important user actions for the needs of the business in order to improve performance.

  42. Correlation

  43. What is Correlation? • Correlation is a measure of the statistical relationship between two comparable data points. • Time associations are typically made. • Correlate to resource demand • Correlate to event or occurrence • Correlation primarily a part of hypothesis and diagnosis.

  44. Examples of Correlation

  45. Examples of Correlation

  46. Examples of Correlation

  47. Examples of Correlation

  48. Getting to Root Cause Analysis

  49. Performance Forensics Methodology

  50. Getting to Root Cause Analysis • Devising a strong problem statement • Foundation steps of Method-R • Knowing where to collect evidence • Formulating a data-driven hypothesis • Appropriate use of correlation, modeling and visualizing • Proving the hypothesis out (test-driven approach) • Establishing a diagnosis • Avoid diagnosis bias and value attribution • Treating the symptoms • A diagnosis is not always black and white

More Related