1 / 23

Performance Debugging for Distributed Systems of Black Boxes

Performance Debugging for Distributed Systems of Black Boxes. Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP 2004 11 November 2004. client. client. web server. web server. web server. authentication server.

virgil
Download Presentation

Performance Debugging for Distributed Systems of Black Boxes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha Muthitacharoen, MIT WISP 2004 11 November 2004

  2. client client web server web server web server authentication server application server application server database server database server Example multi-tier system Project5 - SOSP

  3. Motivation • Complex distributed systems are built from black box components • These systems may have performance problems • High or erratic latency • Locating the causes of these problems is hard • We can’t always examine or modify system components • We need tools to infer where bottlenecks are • Choose which black boxes to open Project5 - SOSP

  4. Contributions of our work • Tools to highlight which black boxes have problems • Require only passive information, such as packet traces • Infer where most of time is spent from traces • Person can then use more invasive tools to examine those boxes • Reduce time and cost to debug complex systems • Improve quality of delivered systems Project5 - SOSP

  5. client client web server web server web server authentication server application server application server 100ms database server database server Example causal path Project5 - SOSP

  6. Goals of our tools • Find high-impact causal paths through a distributed system Causal path:series of nodes that sent/received messages • Each message is caused by receipt of previous message • Some causal paths occur many times High impact: • Occurs frequently • Contributes significantly to overall latency • Without modifications or semantic knowledge • Report per-node latencies on causal paths Project5 - SOSP

  7. Overview of our approach • Obtain traces of messages between components • Ethernet packets, middleware messages, etc. • Collect traces as non-invasively as possible • Analyze traces using algorithms • Visualize results and highlight high-impact paths • Requires very little information: [timestamp, source, destination] Project5 - SOSP

  8. Outline • Problem statement & goals • Overview of our approach • Algorithm • Experimental results • Related work • Conclusions Project5 - SOSP

  9. The convolution algorithm: input Time From To 0.01 A B 0.02 A B 0.04 B D 0.05 C F ... Project5 - SOSP

  10. A C B D E F E E F F G G G G G G The convolution algorithm: output .15 .10 0 0 .10 .10 0 0 0 0 0 0 Project5 - SOSP

  11. Basic idea • Creates a “time signal” for messages from each node • Given time signals S1(t)=(AB) and S2(t)=(BX)Computes convolution of S2(t) and S1(–t) = S1 * S2 (can be computed quickly using fast fourier transforms) S1(t)=(AB msgs) time 1 2 3 4 5 6 7 Project5 - SOSP

  12. S1(t)=(AB msgs) S2(t)=(BX msgs) S1 * S2=conv(S2(t), S1(-t)) • Spikes suggest causality between AB and BX msgs • Time shift of a spike indicates its characteristic delay Project5 - SOSP

  13. A B C Details: first step • Choose starting node A • Use trace to add edges from it Time From To 0.01 A B 0.02 A B 0.04 A C 0.05 A B Project5 - SOSP

  14. (AB)*(BD) (AB)*(BE) d Continuing Time From To … B D … B E … B F … B G A B C ?? Project5 - SOSP

  15. How Time From To t1 A B t2 A B t3 A B t4 A B Time From To … t1+d B D … t2+d B D … t3+d B D t3+d B E … t4+d B D Project5 - SOSP

  16. Heuristic to find spikes threshold 1: n1 stddev over mean threshold 2: n2 stddev over mean n1 = 2n2 = 1.5 Project5 - SOSP

  17. Recursing to continue • Observations: 1. (BD) are not all msgs from B to D (only those caused by A) 2. Stop recursion when too few messages left or no more spikes found A B d D ?? Project5 - SOSP

  18. Outline • Problem statement & goals • Overview of our approach • Algorithm • Experimental results • Conclusions Project5 - SOSP

  19. Results: email service delays • Jeff logged all email headers for two months • Parsed 80K Received headers in 12K messages Received: from cceexg11.americas.cpqcorp.net ... by wera.hpl.hp.com ... ; Fri, 4 Apr 2003 15:35:54 -0800 • Yields (timestamp, sender, receiver) trace records • Used Convolution Algorithm to • Reconstruct message paths • Find typical delays • Note: this is NOT the most direct way to use email headers • We made the problem harder so as to test our algorithm Project5 - SOSP

  20. 60 39 37 67 4890,15 7680,10 7660 4600 7380,10 5230,10 40 38 40 40 38 38 4390 4780 5940 6350 6260 5120 41 41 41 41 41 41 Email trace: output Project5 - SOSP

  21. Sun’s demo application for J2EE Stanford’s PinPoint project provided us with traces One trace has a node that is artificially slowed down Results: Petstore Project5 - SOSP

  22. Future work • Automate trace gathering and conversion • Sliding-window versions of algorithms • Find phased behavior • Reduce memory usage of nesting algorithm • Improve speed of convolution algorithm • Validate usefulness on more complicated systems • What are limits of our approach? Project5 - SOSP

  23. Conclusion • Looking for bottlenecks in black box systems • Use signal processing techniques to find causal pathsin the network and its delays • For more information • http://www.hpl.hp.com/research/project5/ • Contact us if you have multi-hop message traces! Project5 - SOSP

More Related