340 likes | 644 Views
Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. Computer Science Purdue University. Karthik Nagaraj Charles Killian Jennifer Neville. Distributed Systems: BitTorrent. 20% slower But why?. Transmission. CDF. 180 nodes. Download Time (sec).
E N D
Structured Comparative Analysisof Systems Logs to Diagnose Performance Problems Computer Science Purdue University Karthik Nagaraj Charles Killian Jennifer Neville
Distributed Systems: BitTorrent 20% slower But why? Transmission CDF 180 nodes Download Time (sec) Nagaraj, Killian and Neville
BitTorrent: Compare Piece Times Transmission 54 sec at finish CDF 10 sec at start sed awk grep Per-piece Download Time (sec) Nagaraj, Killian and Neville
BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency Transmission Peer nodes Nagaraj, Killian and Neville
BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency Transmission Peer nodes Nagaraj, Killian and Neville
BitTorrent: Compare Logs 19 MB 14 MB Compare Difficulties: • Non-determinism • Concurrency • Large volume of logs Transmission 3.3 GB 2.3 GB Nagaraj, Killian and Neville
Related Work: Performance Debugging Detection Diagnosis Detecting System Problems Mining Console Logs [Xu et. al. SOSP 2009] Distalyzer MacePC [Killian. et. al. FSE 2010] Request Processing Spectroscope [Sambasivan et. al. NSDI 2011] Magpie [Barham et. al. OSDI 2004] Correlating instrumentation data to system states [Cohen et. al. SOSP 2004] Nagaraj, Killian and Neville
Distalyzer Highlights Distalyzer diagnoses performance • Practical • Existing logs • Automatic • Machine Learning • For non-expert developers • Identifies: • Significant divergences • Impact on Performance Successful demonstration Transmission Nagaraj, Killian and Neville
Design of Distalyzer Logs • Feature creation • Extract behaviors influencing overall performance • Predictive modeling • Find distinguishing features • Descriptive modeling • Component interactions • Attention focusing • Root cause description State Event Nagaraj, Killian and Neville
Code Instrumentation Free text logging • Systems already have logging • Categorize logs • log4j • printf Event logs: some event happened Practical 1309116804.465942 10.0.0.74:8090 Middle-ground Received BT_PIECE for block 1058 By post-processing logs State logs: value of system variable Fully structured logging • Pip [1] • Xtrace[2] 1309116871.387768 Progress: 25.05% Speed: 237 KiB/s / 0 KiB/s Detailed • [1]: Reynolds et. al. NSDI 2006 • [2]: Fonseca et. al. NSDI 2007 Nagaraj, Killian and Neville
Design: Feature Creation Capture summarized log characteristics of complete execution Events List of Timestamps E.g. receive_bt_piece Longer execution Log file Time Normalized Time First Median Last Count Relative {First, Median, Last} Nagaraj, Killian and Neville
Design: Feature Creation E.g. Download speed (KB/s) States List of Values {42,204,191,…,55} Final Maximum Average Minimum ¼ ½ ¾ Time Quarter Snapshots +Relative Nagaraj, Killian and Neville
Design: Predictive Modeling • Mean • Variance • Which features differ most? • E.g.First(recv_bt_piece) • E.g.Max(Download speed) • Welch’s T-Test • Significance: Really different? • Magnitude of difference • Rank significant variables Nagaraj, Killian and Neville
Design: Descriptive Modeling • How do system components interact? • Dependency networks [1] Example Feature Identify Dependencies A B C First occurrence Events : recv_bt_piece : sent_bt_piece : recv_bt_request … A A B B A C C Strength (edge) D D F F B D E F E E C Dependency graph [1]: Heckerman et. al. JMLR 2001 Nagaraj, Killian and Neville
Design: Attention Focusing • Best smallest explanation of the problem Performance metric Relative Median Feature type: ƒ Count Relative First Last Median First Relative Last B B B B B B B B A A A A A A A A C C C C C C C C D D D D D D D D E E E E E E E E F F F F F F F F G G G G G G G G H H H H H H H H Nagaraj, Killian and Neville
Design: Attention Focusing • Score divergence and dependence equally Performance metric: B B A C Remove weak edges D E F Max spanning tree Final Score= Σw(node) * w(edge) G H Nagaraj, Killian and Neville
Implementation Logs • Parsers for text logs • Distalyzer API to categorize logs • Small (<100 lines Perl/Python) • One-time • Distalyzer • 4000 lines of code in Python & C++ • Offline usage, Highly parallel • Web frontend using JavaScript • Helps explore logs • Publicly available Parser Distalyzer Nagaraj, Killian and Neville
Diagnosis Case Studies 3 systems 6 performance problems • TritonSort (Sorting) [NSDI 2011] • Different versions • HBase (BigTable) • Different requests • Transmission (BitTorrent) • Different implementations Transmission Nagaraj, Killian and Neville
TritonSort (Different Versions) Performance Fast Slow Nightly builds • Suddenly 74% slower on a day • Baseline: Faster execution with same workload • Diagnosed in 3-4hrs • Manual debugging took “the better part of 2 days” Disclaimer: This plot was generated using fictitious data, and is only meant for illustration purposes. Nagaraj, Killian and Neville CDF Sorting Time (sec)
TritonSort Events States • Writeris significant • Outputs to disk Hard disk drive Cache Writer module Write-through Nagaraj, Killian and Neville
HBase (Different Requests) Slow Fast • Yahoo Cloud Storage Benchmark • 1Million operations: 95% reads, 5% writes • Baseline: Fast requests in same execution ≤ 15msec ≥ 100msec CDF 0 Request latency (msec) Nagaraj, Killian and Neville
HBase: Slow Outliers • client.HTable.get_lookup suspicious • Code just preceding this log • Region (Tablet)lookup by the client Edge triggered divergence increase Events Nagaraj, Killian and Neville
HBase: Root Cause Request Row(r) Success Client Server failed Tablet server I do not serve this tablet anymore! Master 1 second wasted Nagaraj, Killian and Neville
BitTorrent (Different Implementations) Slow Fast • Transmission 20% slower than Azureus • Baseline: the Azureus implementation CDF Transmission 180 nodes 250 KBps upload cap Download Time (sec) Nagaraj, Killian and Neville
BitTorrent: Transmission • Dependency graph for Last feature • Sent_Bt_Interested lags in Transmission • A timer initiates this log Control flow Events Nagaraj, Killian and Neville
Transmission: Root Cause 10 second I am interested I am interested I am interested 1 second Interested Interested Interested Interested Interested Interested Nagaraj, Killian and Neville
Conclusion • Diagnose performance problems by comparing against baseline • Different implementations, runs and requests • Distalyzer automatically diagnoses performance • Successful in 3 mature & popular distributed systems Distalyzeris free software http://www.macesystems.org/distalyzer NSDI 2012 Nagaraj, Killian and Neville
Backup slides Nagaraj, Killian and Neville
Facts of Life • Need enough samples • For statistical significance • Similar execution environments • Experimental artifact or performance problem? • Can be relaxed with developer annotations • Needs log data to succeed! • Performance overheads from logging • Logs should contain the bug Nagaraj, Killian and Neville
Transmission Fixed • 288.0 sec Transmission vs. 287.7 sec Azureus Transmission CDF Download Time (sec) Nagaraj, Killian and Neville
Clock Synchrony • Responsibility of the instrumentation framework • Relative synchrony • Master can broadcast START, and all nodes log it • No, if granularity is already small Nagaraj, Killian and Neville
Related work (2) • NetMedic[Kandula et. Al. Sigcomm2009] • Giza [Mahimkar et. al. Sigcomm2009] • WISE [Tariq et. al. Sigcomm2008] • Distalyzer • Software problems • Inter-component Dependencies and faults Nagaraj, Killian and Neville
Diagnosed Performance Problems • TritonSort • Hard disk loose cache battery (previously diagnosed) • Hbase (22%) • Tablet server lookup mishandled • Linux default disk scheduler (CFQ) • Network slowdown (Nagle’s algorithm) • Transmission (45%) • Same-IP peers mishandled • Sent BT_Interested timer Nagaraj, Killian and Neville