1 / 13

Pinpoint: Problem Determination in Large, Dynamic Internet Services

Explore the Pinpoint approach for pinpointing failures in large, dynamic internet services through dynamic tracing and statistical analysis without the need for application code changes. This method reduces human work and improves system accuracy and precision.

dykesj
Download Presentation

Pinpoint: Problem Determination in Large, Dynamic Internet Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin mikechen@cs.berkeley.edu {emrek, fratkin}@cs.stanford.edu ROC Retreat, 2002/01

  2. Motivation • Systems are large and getting larger • 1000’s of replicated HW/SW components • used in different combinations • Systems are dynamic • resources are allocated at runtime • e.g. load balancing, personalization • Difficult to diagnose failures • how to tell what’s different about failed requests?

  3. Current Techniques • Dependency Models • Detect failures, check all components that failed requests depend on • Problem: • Need to check all dependencies • Hard to generate and keep up-to-date • Monitoring & Alarm Correlation • Detect non-functioning components and often generates alarm storms • filter alarms for root-cause analysis • Problem: • need to instrument every component • hard to detect interaction faults

  4. The Pinpoint Approach • Trace many real client requests • Record every component used in a request • Detect success/failure of requests • Can be used as dynamic dependency graphs • Statistical Correlation • Search for components that “cause” failures • Built into middleware • Requires no application code changes • Application knowledge only for end-to-end failure detection

  5. A C B External F/D 1,A 1,C 2,B.. Statistical Analysis 1, success 2, fail 3, ... Framework Components Requests #1 #2 Communications Layer (Tracing & Internal F/D) #3 Detected Faults Logs

  6. Prototype Implementation • Built on top of J2EE platform • Sun J2EE 1.2 single-node reference impl. • Added logging of Beans, JSP, & JSP tags • Detect exceptions thrown out of components • Required no application code changes • Layer 7 network sniffer • TCP timeouts, malformed HTML, app-level string searches • PolyAnalyst statistical analysis • Bucket analysis & dependency discovery

  7. Experimental Setup • Demo app: J2EE Pet Store • e-commerce site w/~30 components • Load generator • replay trace of browsing • Approx. TPCW WIPSo load (~50% ordering) • Fault injection parameters • Trigger faults based on combinations of used components • Inject exceptions, infinite loops, null calls • 55 tests with single-components faults and interaction faults • 5-min runs of a single client (J2EE server limitation)

  8. Application Observations • # of components used in a dynamic request: median 14, min 6, max 23 • large number of tightly coupled components that are always used together

  9. Correctly Identified Faults (C) Predicted Faults (P) Actual Faults (A) Metrics • Precision: C/P • Recall: C/A • Accuracy: whether all actual faults are correctly identified (recall == 100%) • boolean measure

  10. 4 Analysis Techniques • Pinpoint: clusters of components that statistically correlate with failures • Detection: components where Java exceptions were detected • union across all failed requests • similar to what an event monitoring system outputs • Intersection: intersection of components used in failed requests • Union: union of all components used in failed requests

  11. Results • Pinpoint has high accuracy with relatively high precision

  12. Pinpoint Prototype Limitations • Assumptions • client requests provide good coverage over components and combinations • requests are autonomous (don’t corrupt state and cause later requests to fail) • Currently can’t detect the following: • faults that only degrade performance • faults due to pathological inputs

  13. Conclusions • Dynamic tracing and statistical analysis give improvements in accuracy & precision • Handles dynamic configurations well • Without requiring application code changes • Reduces human work in large systems • But, need good coverage of combinations and autonomous requests • Future Work • Test with real distributed systems with real applications • Oceanstore? WebLogic/WebSphere? • Capture additional differentiating factors • Visualization of dynamic dependency • Performance analysis • Online data analysis

More Related