180 likes | 331 Views
Pinpoint: Problem Determination in Large, Dynamic Internet Services. M. Chen, E. Kıcıman, E. Fratkin, A. Fox and E. Brewer. Presenter Spiros Xanthos. Motivation. “Traditional” root cause analysis techniques are based on static dependency models
E N D
Pinpoint: Problem Determination in Large, Dynamic Internet Services M. Chen, E. Kıcıman, E. Fratkin, A. Fox and E. Brewer Presenter Spiros Xanthos
Motivation • “Traditional” root cause analysis techniques are based on static dependency models • Difficult to generate and maintain an accurate model of a constantly evolving Internet service. • Application-level knowledge of the systems being monitored is required in order to identify faults • Dynamic Problem Determination Technique
Outline • Data Clustering Approach • Request Tracing and Failure Detection • Statistical Analysis • Pinpoint Implementation • Evaluation • Limitations • Conclusions
Data Clustering Approach • Dynamically trace real client requests through a system. • For each request, record its believed success or failure in each of the components • Perform data clustering to correlate the failures to the components that most likely to have caused them.
Client Request Tracing • As a client request travels through the system, we are interested in recording all the components it uses • Each request is assigned an ID • Subsequently, the system can record the <ID, Component> for every request that arrive at a specific Component.
Client Request Tracing Request #1 Request #2 Request #2 Request #1 Request #1 A B C Request #2 Internal Tracing Framework Trace Log <1,A> <1,B> <2,B> <2,C> Statistical Analysis … …
Failure Detection • A failure detection system is attempting to detect whether the requests are successfully completing. • Internal Failure Detection • Failed assertions • Exceptions • External Failure Detection • Complete service failures, such as system crashes.
Client Request Tracing Request #1 Request #2 Request #2 Request #1 Request #1 A B C Request #2 F/D System Internal Tracing Framework Fault Log Trace Log 1, Success <1,A> 2, Fail <1,B> 3, … <2,B> <2,C> Statistical Analysis … …
Data Analysis - Clustering • Performing data clustering to correlate the failures of requests to the components most likely to have caused them. • Attempt to find the components or combinations of components that are most highly correlated with the failures of requests
Pinpoint Implementation • Modified J2EE Reference Implementation with: • client request tracing and • simple fault detection. • Every client HTTP request gets a unique ID • Internal fault detection mechanism logs exceptions • External sniffer detects TCP errors such as resets and timeouts and HTTP errors (404 etc.)
Evaluation • Used the J2EE PetStore demo application • Injected 4 different types of faults: • Declared exceptions, such as Java RemoteExceptions or IOExceptions. • Undeclared exceptions, such as runtime exceptions. • Infinite loops, where the called component never returns control to the callee. • Null calls, where the called component is never actually executed.
Metrics • Accuracy: • 2 components, A and B, are interacting to cause a failure • Identifying only one or neither, would not be accurate, • How often the results are accurate. • Precision: • A and B, are interacting to cause a failure • Predicting the set {A, B, C, D, E} gives a precision of 40%.
Compare to • Detection, returns the set of components recorded by the internal fault detection framework • Dependency Checking, returns the components that the failed requests use • A component is nominated as a potential fault if it occurs in more than some percentage of the failed requests. • Sensitivity to 0%, this technique returns only components that occurred in 100% of the failed requests
Results • Pinpoint achieves an accuracy of 80-90% with a relatively high precision of 50-60%. Accuracy vs False Positive Rate, Overall
Results Single components faults
Limitations • Cannot distinguish between sets of components that are tightly coupled and used together • It does not work with faults that corrupt state and affect subsequent requests. • Deterministic failures due to pathological inputs cannot be distinguished from other failures
More Limitations… • Cannot capture “fail-stutter” faults where components mask faults internally and exhibit only a decrease in performance. • Difficult to create Internal Fault Detectors, Java forces to write Exception handling code • More…
Conclusions • Pinpoint is an effective problem determination technique. • Better than previous methods because it does not rely on static dependency models • Pinpoint, requires no application-level knowledge of the systems being monitored or any knowledge of the requests. • Still has some limitations: • tightly couple components, state corrupting faults