1 / 33

Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems

Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. Computer Science Purdue University. Karthik Nagaraj Charles Killian Jennifer Neville. Distributed Systems: BitTorrent. 20% slower But why?. Transmission. CDF. 180 nodes. Download Time (sec).

gaetan
Download Presentation

Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structured Comparative Analysisof Systems Logs to Diagnose Performance Problems Computer Science Purdue University Karthik Nagaraj Charles Killian Jennifer Neville

  2. Distributed Systems: BitTorrent 20% slower But why? Transmission CDF 180 nodes Download Time (sec) Nagaraj, Killian and Neville

  3. BitTorrent: Compare Piece Times Transmission 54 sec at finish CDF 10 sec at start sed awk grep Per-piece Download Time (sec) Nagaraj, Killian and Neville

  4. BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency Transmission Peer nodes Nagaraj, Killian and Neville

  5. BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency Transmission Peer nodes Nagaraj, Killian and Neville

  6. BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency • Large volume of logs Transmission 3.3 GB 2.3 GB Nagaraj, Killian and Neville

  7. Related Work: Performance Debugging Detection Diagnosis Detecting System Problems Mining Console Logs [Xu et. al. SOSP 2009] Distalyzer MacePC [Killian. et. al. FSE 2010] Request Processing Spectroscope [Sambasivan et. al. NSDI 2011] Magpie [Barham et. al. OSDI 2004] Correlating instrumentation data to system states [Cohen et. al. SOSP 2004] Nagaraj, Killian and Neville

  8. Distalyzer Highlights Distalyzer diagnoses performance • Practical • Existing logs • Automatic • Machine Learning • For non-expert developers • Identifies: • Significant divergences • Impact on Performance Successful demonstration Transmission Nagaraj, Killian and Neville

  9. Design of Distalyzer Logs • Feature creation • Extract behaviors influencing overall performance • Predictive modeling • Find distinguishing features • Descriptive modeling • Component interactions • Attention focusing • Root cause description State Event Nagaraj, Killian and Neville

  10. Code Instrumentation Free text logging • Systems already have logging • Categorize logs • log4j • printf Event logs: some event happened Practical 1309116804.465942 10.0.0.74:8090 Middle-ground Received BT_PIECE for block 1058 By post-processing logs State logs: value of system variable Fully structured logging • Pip [1] • Xtrace[2] 1309116871.387768 Progress: 25.05% Speed: 237 KiB/s / 0 KiB/s Detailed • [1]: Reynolds et. al. NSDI 2006 • [2]: Fonseca et. al. NSDI 2007 Nagaraj, Killian and Neville

  11. Design: Feature Creation Capture summarized log characteristics of complete execution Events List of Timestamps E.g. receive_bt_piece Longer execution Log file Time Normalized Time First Median Last Count Relative {First, Median, Last} Nagaraj, Killian and Neville

  12. Design: Feature Creation E.g. Download speed (KB/s) States List of Values {42,204,191,…,55} Final Maximum Average Minimum ¼ ½ ¾ Time Quarter Snapshots +Relative Nagaraj, Killian and Neville

  13. Design: Predictive Modeling • Mean • Variance • Which features differ most? • E.g.First(recv_bt_piece) • E.g.Max(Download speed) • Welch’s T-Test • Significance: Really different? • Magnitude of difference • Rank significant variables Nagaraj, Killian and Neville

  14. Design: Descriptive Modeling • How do system components interact? • Dependency networks [1] Example Feature Identify Dependencies A B C First occurrence Events : recv_bt_piece : sent_bt_piece : recv_bt_request … A A B B A C C Strength (edge) D D F F B D E F E E C Dependency graph [1]: Heckerman et. al. JMLR 2001 Nagaraj, Killian and Neville

  15. Design: Attention Focusing • Best smallest explanation of the problem Performance metric Relative Median Feature type: ƒ Count Relative First Last Median First Relative Last B B B B B B B B A A A A A A A A C C C C C C C C D D D D D D D D E E E E E E E E F F F F F F F F G G G G G G G G H H H H H H H H Nagaraj, Killian and Neville

  16. Design: Attention Focusing • Score divergence and dependence equally Performance metric: B B A C Remove weak edges D E F Max spanning tree Final Score= Σw(node) * w(edge) G H Nagaraj, Killian and Neville

  17. Implementation Logs • Parsers for text logs • Distalyzer API to categorize logs • Small (<100 lines Perl/Python) • One-time • Distalyzer • 4000 lines of code in Python & C++ • Offline usage, Highly parallel • Web frontend using JavaScript • Helps explore logs • Publicly available Parser Distalyzer Nagaraj, Killian and Neville

  18. Diagnosis Case Studies 3 systems 6 performance problems • TritonSort (Sorting) [NSDI 2011] • Different versions • HBase (BigTable) • Different requests • Transmission (BitTorrent) • Different implementations Transmission Nagaraj, Killian and Neville

  19. TritonSort (Different Versions) Performance Fast Slow Nightly builds • Suddenly 74% slower on a day • Baseline: Faster execution with same workload • Diagnosed in 3-4hrs • Manual debugging took “the better part of 2 days” Disclaimer: This plot was generated using fictitious data, and is only meant for illustration purposes. Nagaraj, Killian and Neville CDF Sorting Time (sec)

  20. TritonSort Events States • Writeris significant • Outputs to disk Hard disk drive Cache Writer module Write-through Nagaraj, Killian and Neville

  21. HBase (Different Requests) Slow Fast • Yahoo Cloud Storage Benchmark • 1Million operations: 95% reads, 5% writes • Baseline: Fast requests in same execution ≤ 15msec ≥ 100msec CDF 0 Request latency (msec) Nagaraj, Killian and Neville

  22. HBase: Slow Outliers • client.HTable.get_lookup suspicious • Code just preceding this log • Region (Tablet)lookup by the client Edge triggered divergence increase Events Nagaraj, Killian and Neville

  23. HBase: Root Cause Request Row(r) Success Client Server failed Tablet server I do not serve this tablet anymore! Master 1 second wasted Nagaraj, Killian and Neville

  24. BitTorrent (Different Implementations) Slow Fast • Transmission 20% slower than Azureus • Baseline: the Azureus implementation CDF Transmission 180 nodes 250 KBps upload cap Download Time (sec) Nagaraj, Killian and Neville

  25. BitTorrent: Transmission • Dependency graph for Last feature • Sent_Bt_Interested lags in Transmission • A timer initiates this log Control flow Events Nagaraj, Killian and Neville

  26. Transmission: Root Cause 10 second I am interested I am interested I am interested 1 second Interested Interested Interested Interested Interested Interested Nagaraj, Killian and Neville

  27. Conclusion • Diagnose performance problems by comparing against baseline • Different implementations, runs and requests • Distalyzer automatically diagnoses performance • Successful in 3 mature & popular distributed systems Distalyzeris free software http://www.macesystems.org/distalyzer NSDI 2012 Nagaraj, Killian and Neville

  28. Backup slides Nagaraj, Killian and Neville

  29. Facts of Life • Need enough samples • For statistical significance • Similar execution environments • Experimental artifact or performance problem? • Can be relaxed with developer annotations • Needs log data to succeed! • Performance overheads from logging • Logs should contain the bug Nagaraj, Killian and Neville

  30. Transmission Fixed • 288.0 sec Transmission vs. 287.7 sec Azureus Transmission CDF Download Time (sec) Nagaraj, Killian and Neville

  31. Clock Synchrony • Responsibility of the instrumentation framework • Relative synchrony • Master can broadcast START, and all nodes log it • No, if granularity is already small Nagaraj, Killian and Neville

  32. Related work (2) • NetMedic[Kandula et. Al. Sigcomm2009] • Giza [Mahimkar et. al. Sigcomm2009] • WISE [Tariq et. al. Sigcomm2008] • Distalyzer • Software problems • Inter-component Dependencies and faults Nagaraj, Killian and Neville

  33. Diagnosed Performance Problems • TritonSort • Hard disk loose cache battery (previously diagnosed) • Hbase (22%) • Tablet server lookup mishandled • Linux default disk scheduler (CFQ) • Network slowdown (Nagle’s algorithm) • Transmission (45%) • Same-IP peers mishandled • Sent BT_Interested timer Nagaraj, Killian and Neville

More Related