The Mystery Machine: End-to-end performance analysis of large-scale Internet services

The Mystery Machine:End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch

Internet services are complex Tasks Time Scale and heterogeneity make Internet services complex Michael Chow

Analysis Pipeline Michael Chow

Step 1: Identify segments Michael Chow

Step 2: Infer causal model Michael Chow

Step 3: Analyze individual requests Michael Chow

Step 4: Aggregate results Michael Chow

Challenges Need method that works at scale with heterogeneous components • Previous methods derive a causal model • Instrument scheduler and communication • Build model through human knowledge Michael Chow

Opportunities Tremendous detail about a request’s execution Coverage of a large range of behaviors • Component-level logging is ubiquitous • Handle a large number of requests Michael Chow

The Mystery Machine 1) Infer causal model from large corpus of traces • Identify segments • Hypothesize all possible causal relationships • Reject hypotheses with observed counterexamples 2) Analysis • Critical path, slack, anomaly detection, what-if Michael Chow

Step 1: Identify segments Michael Chow

Define a minimal schema Task Segment 1 Segment 2 Event Event Event Michael Chow

Define a minimal schema Task Segment 1 Segment 2 Event Event Event Request identifier Machine identifier Timestamp Task Event Aggregate existing logs using minimal schema Michael Chow

Step 2: Infer causal model Michael Chow

Types of causal relationships A B B A A B A B OR B A A A’ C’ A B’ B B B’ C’ C A’ C t1 t1 t2 t2 Michael Chow

Producing causal model S1 N1 C1 Causal Model S2 N2 C2 Michael Chow

Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow

Producing causal model S1 N1 C1 Causal Model S2 N2 C2 S1 N1 C1 Trace 2 S2 N2 C2 Time Michael Chow

Step 3: Analyze individual requests Michael Chow

Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Time Michael Chow

Critical path using causal model S1 N1 C1 S2 N2 C2 S1 N1 C1 Trace 1 S2 N2 C2 Michael Chow

Step 4: Aggregate results Michael Chow

Inaccuracies of Naïve Aggregation Michael Chow

Inaccuracies of Naïve Aggregation Need a causal model to correctly understand latency Michael Chow

High variance in critical path Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

High variance in critical path Client Server Network cdf 20% of requests, server contributes 10% of latency Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

High variance in critical path 20% of requests, server contributes 50% or more of latency Client Server Network cdf Percent of critical path Percent of critical path Percent of critical path • Breakdown in critical path shifts drastically • Server, network, or client can dominate latency Michael Chow

Diverse clients and networks Client Network Server Client Network Server Client Network Server Michael Chow

Differentiated service Slack in server generation time Produce data slower End-to-end latency stays same No slack in server generation time Produce data faster Decrease end-to-end latency Deliver data when needed and reduce average response time Michael Chow

Additional analysis techniques S1 N1 C1 S2 N2 C2 Time • Slack analysis • What-if analysis • Use natural variation in large data set Michael Chow

What-if questions Does server generation time affect end-to-end latency? Can we predict which connections exhibit server slack? Michael Chow

Server slack analysis Slack < 25ms Slack > 2.5s Michael Chow

Server slack analysis Slack < 25ms Slack > 2.5s End-to-end latency increases as server generation time increases Server generation time has little effect on end-to-end latency Michael Chow

Predicting server slack Second slack (ms) First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow

Predicting server slack Classifies 83% of requests correctly Type II Error 9% Second slack (ms) Type I Error 8% First slack (ms) Predict slack at the receipt of a request Past slack is representative of future slack Michael Chow

Conclusion Michael Chow

Questions Michael Chow

The Mystery Machine: End-to-end performance analysis of large-scale Internet services

The Mystery Machine: End-to-end performance analysis of large-scale Internet services

Presentation Transcript

A Framework for Analyzing Strategies of Internet Service Providers (ISP)

The Internet

Large-Scale SQL Server Deployments for DBAs

Performance Analysis using Windows Performance Toolkit

Siebel 7 Performance and Scalability Inside the Siebel Server

4. SCALE-UP OF BIOREACTOR SYSTEMS

Experiences With Internet Traffic Measurement and Analysis

Impact, Washback and Consequences of Large-scale Testing

Introduction to Large Scale Modeling Systems

Large-Scale Copy Detection

Thesis Defense Large -Scale Graph Computation on Just a PC

A Couple of Ways to Skin an Internet-Scale Cat

GraphChi : Large-Scale Graph Computation on Just a PC

Architectures and Algorithms for Internet-Scale (P2P) Data Management

Scalable Web Architectures

Impact of the internet on language

CS 54001-1: Large-Scale Networked Systems

Week 4 The Large Scale Universe

Large Scale Studies of Dyslexia in Florida

M240B Machine Gun Operators Course

Performance Issues of Web Services