470 likes | 660 Views
The Mystery Machine: End-to-end performance analysis of large-scale Internet services. Michael Chow David Meisner , Jason Flinn , Daniel Peek, Thomas F. Wenisch. Internet services are complex. Tasks. Time. Scale and heterogeneity make Internet services complex.
E N D
The Mystery Machine:End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch
Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow
Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow
Analysis Pipeline Michael Chow
Step 1: Identify segments Michael Chow
Step 2: Infer causal model Michael Chow
Step 3: Analyze individual requests Michael Chow
Step 4: Aggregate results Michael Chow
Challenges Need method that works at scale with heterogeneous components • Previous methods derive a causal model • Instrument scheduler and communication • Build model through human knowledge Michael Chow
Opportunities Tremendous detail about a request’s execution Coverage of a large range of behaviors • Component-level logging is ubiquitous • Handle a large number of requests Michael Chow
The Mystery Machine 1) Infer causal model from large corpus of traces • Identify segments • Hypothesize all possible causal relationships • Reject hypotheses with observed counterexamples 2) Analysis • Critical path, slack, anomaly detection, what-if Michael Chow
Step 1: Identify segments Michael Chow
Define a minimal schema Task Segment 1 Segment 2 Event Event Event Michael Chow
Define a minimal schema Task Segment 1 Segment 2 Event Event Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema Michael Chow
Step 2: Infer causal model Michael Chow
Types of causal relationships A B B A A B A B OR B A A A’ C’ A B’ B B B’ C’ C A’ C t1 t1 t2 t2 Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow
Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow
Step 3: Analyze individual requests Michael Chow
Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow
Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Michael Chow
Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Michael Chow
Step 4: Aggregate results Michael Chow
Inaccuracies of Naïve Aggregation Michael Chow
Inaccuracies of Naïve Aggregation Michael Chow
Inaccuracies of Naïve Aggregation Need a causal model to correctly understand latency Michael Chow
High variance in critical path Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow
High variance in critical path Client Server Network cdf 20% of requests, server contributes 10% of latency Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow
High variance in critical path 20% of requests, server contributes 50% or more of latency Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow
Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow
Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow
Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow
Differentiated service Slack in server generation time Produce data slower End-to-end latency stays same No slack in server generation time Produce data faster Decrease end-to-end latency Deliver data when needed and reduce average response time Michael Chow
Additional analysis techniques S1 N1 C1 S2 N2 C2 Time • Slack analysis • What-if analysis • Use natural variation in large data set Michael Chow
What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow
Server slack analysis Slack < 25ms Slack > 2.5s Michael Chow
Server slack analysis Slack < 25ms Slack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end-to-end latency Michael Chow
Predicting server slack Second slack (ms) First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow
Predicting server slack Classifies 83% of requests correctly Type II Error 9% Second slack (ms) Type I Error 8% First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow
Conclusion Michael Chow
Questions Michael Chow