240 likes | 388 Views
Performance Debugging for Distributed Systems of Black Boxes. Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP 2004 11 November 2004. client. client. web server. web server. web server. authentication server.
E N D
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP 2004 11 November 2004
client client web server web server web server authentication server application server application server database server database server Example multi-tier system Project5 - SOSP
Motivation • Complex distributed systems are built from black box components • These systems may have performance problems • High or erratic latency • Locating the causes of these problems is hard • We can’t always examine or modify system components • We need tools to infer where bottlenecks are • Choose which black boxes to open Project5 - SOSP
Contributions of our work • Tools to highlight which black boxes have problems • Require only passive information, such as packet traces • Infer where most of time is spent from traces • Person can then use more invasive tools to examine those boxes • Reduce time and cost to debug complex systems • Improve quality of delivered systems Project5 - SOSP
client client web server web server web server authentication server application server application server 100ms database server database server Example causal path Project5 - SOSP
Goals of our tools • Find high-impact causal paths through a distributed system Causal path:series of nodes that sent/received messages • Each message is caused by receipt of previous message • Some causal paths occur many times High impact: • Occurs frequently • Contributes significantly to overall latency • Without modifications or semantic knowledge • Report per-node latencies on causal paths Project5 - SOSP
Overview of our approach • Obtain traces of messages between components • Ethernet packets, middleware messages, etc. • Collect traces as non-invasively as possible • Analyze traces using algorithms • Visualize results and highlight high-impact paths • Requires very little information: [timestamp, source, destination] Project5 - SOSP
Outline • Problem statement & goals • Overview of our approach • Algorithm • Experimental results • Related work • Conclusions Project5 - SOSP
The convolution algorithm: input Time From To 0.01 A B 0.02 A B 0.04 B D 0.05 C F ... Project5 - SOSP
A C B D E F E E F F G G G G G G The convolution algorithm: output .15 .10 0 0 .10 .10 0 0 0 0 0 0 Project5 - SOSP
Basic idea • Creates a “time signal” for messages from each node • Given time signals S1(t)=(AB) and S2(t)=(BX)Computes convolution of S2(t) and S1(–t) = S1 * S2 (can be computed quickly using fast fourier transforms) S1(t)=(AB msgs) time 1 2 3 4 5 6 7 Project5 - SOSP
S1(t)=(AB msgs) S2(t)=(BX msgs) S1 * S2=conv(S2(t), S1(-t)) • Spikes suggest causality between AB and BX msgs • Time shift of a spike indicates its characteristic delay Project5 - SOSP
A B C Details: first step • Choose starting node A • Use trace to add edges from it Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B Project5 - SOSP
(AB)*(BD) (AB)*(BE) d Continuing Time From To … B D … B E … B F … B G A B C ?? Project5 - SOSP
How Time From To t1 A B t2 A B t3 A B t4 A B Time From To … t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D Project5 - SOSP
Heuristic to find spikes threshold 1: n1 stddev over mean threshold 2: n2 stddev over mean n1 = 2n2 = 1.5 Project5 - SOSP
Recursing to continue • Observations: 1. (BD) are not all msgs from B to D (only those caused by A) 2. Stop recursion when too few messages left or no more spikes found A B d D ?? Project5 - SOSP
Outline • Problem statement & goals • Overview of our approach • Algorithm • Experimental results • Conclusions Project5 - SOSP
Results: email service delays • Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messages Received: from cceexg11.americas.cpqcorp.net ... by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800 • Yields (timestamp, sender, receiver) trace records • Used Convolution Algorithm to • Reconstruct message paths • Find typical delays • Note: this is NOT the most direct way to use email headers • We made the problem harder so as to test our algorithm Project5 - SOSP
60 39 37 67 4890,15 7680,10 7660 4600 7380,10 5230,10 40 38 40 40 38 38 4390 4780 5940 6350 6260 5120 41 41 41 41 41 41 Email trace: output Project5 - SOSP
Sun’s demo application for J2EE Stanford’s PinPoint project provided us with traces One trace has a node that is artificially slowed down Results: Petstore Project5 - SOSP
Future work • Automate trace gathering and conversion • Sliding-window versions of algorithms • Find phased behavior • Reduce memory usage of nesting algorithm • Improve speed of convolution algorithm • Validate usefulness on more complicated systems • What are limits of our approach? Project5 - SOSP
Conclusion • Looking for bottlenecks in black box systems • Use signal processing techniques to find causal pathsin the network and its delays • For more information • http://www.hpl.hp.com/research/project5/ • Contact us if you have multi-hop message traces! Project5 - SOSP