1 / 35

Shimin Chen LBA Reading Group

Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems S. Bhatia, A. Kumar, M. E. Fiuczynski, L. Peterson (Princeton & Google), OSDI’08. Shimin Chen LBA Reading Group. Introduction. Troubleshooting complex software systems is difficult Undetected software bugs

melora
Download Presentation

Shimin Chen LBA Reading Group

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lightweight, High-Resolution Monitoring for Troubleshooting Production SystemsS. Bhatia, A. Kumar, M. E. Fiuczynski, L. Peterson (Princeton & Google), OSDI’08 Shimin Chen LBA Reading Group

  2. Introduction • Troubleshooting complex software systems is difficult • Undetected software bugs • Intermittent hardware failures • Configuration errors • Unanticipated workloads

  3. Why hard to reproduce and diagnose production-only problems? • Enough data for diagnosis • Un-expected faults • Collected all time for temporally distant faults and failures • Whole system: “domino effects” • System crashes and non-determinism • Low costs • Monitoring and diagnosis overhead • Modifications to applications or OS • Not require taking the system offline

  4. New Tool: Chopstix • Data collection: • Continuously collects summaries of system’s behavior • Low-level OS operations with detailed contextual information • Keep data for weeks • <1% CPU utilization, 256KiB RAM, 16MiB/daily log • Aggregation and visualization

  5. Help diagnose a class of problems • For example:“What was the system doing last Wednesday around 5pm when the ssh prompt latency was temporarily high, yet system load appeared to be low?

  6. Contributions • How probabilistic data structure (sketches) enables detailed and lightweight data collection • How sysadmin can use this data to diagnose problems (e.g., PlanetLab)

  7. Outline • Introduction • System Design (Section 2 & 4) • Usage Model (Section 3) • Evaluation (Section 5) • Discussion (Section 6) • Summary

  8. System Components • Data collector • Implemented in the kernel • Companion user process periodically copies data from the kernel to disk • A polling process fetches data from every machine in a networked system to a central location • Aggregator • Visualizer

  9. Data Collector • Goal: low resource usage & high coverage • Solution: sketch-based sampling • Problem of uniform sampling: • Infrequent events less likely to be recorded • The idea of sketch: • Sampling rate is adjusted for different event frequency • Here, “sketch” means event frequency

  10. Data Collector: five steps • A trigger fires for a potentially interesting event; • The relevant event data structure is retrieved and a sketch is updated; • A sampling function is evaluated to determine if this event should be recorded; • If so, a sample of relevant information is saved; • A user-level process periodically polls the kernel for this data and resets the data structure, thereby defining a data collection epoch.

  11. Event Triggers • A trigger fires for a potentially interesting event; • Polling has problems • Waste when no activity • Polling frequency may not be high enough if high activity • Triggers: • Instrument OS interfaces in kernel: • E.g., calls to page allocators • For HW stats, such as L2 misses, make the processor generate periodic interrupts. Interrupt handlers are the triggers.

  12. The relevant event data structure is retrieved and a sketch is updated; Sketch Update • “sketch” means event frequency in a given data-collection epoch • To reduce space overheads, use a hash table for all sketches • hash key: event type, VM addr, executable identifier, uid, etc. (event type specific) • Experiments show low probability of false negatives • When an event is triggered, compute the hash index, increment the sketch with a weight • Weight: jiffies of the event • Longer events have larger weights

  13. A sampling function is evaluated to determine if this event should be recorded; Logarithm Sampling Function • Goal: # samples of any one event = log (event frequency) • Implementation: • Choose an integer t=2k • If event frequency is power of t, record sample • In other words, if event frequency is an integer with k zeros, 2k zeros, 3k zeros, …, in the least significant bits, record sample • How to choose t? • Low watermark and high watermark (CPU cycles consumed by Chopstix) • Double t if > high watermark in previous epoch • Halve t if < low watermark in previous epoch

  14. Event Sample • If so, a sample of relevant information is saved; • Stack trace, uid, program id, other event-specific details • Kernel stack trace: tracing frame pointer • User-level stack trace: similar • But stop if the stack is paged out

  15. A user-level process periodically polls the kernel for this data and resets the data structure, thereby defining a data collection epoch. Epochs • Periodically copy samples and hash table out • Reinitialize hash table • Double copies of in-kernel Chopstix data structure for fast swap

  16. Vital Signs (Event Types) • 11 vital signs: • CPU utilization • scheduling delay • resource blocking • disk I/O activity • user page allocation • kernel page allocation • L2-cache utilization • system call invocation • signal delivery • socket transmission • mutex/semaphore locking.

  17. More details

  18. Aggregator & Visualizer • Macromedia Flash web interface

  19. Outline • Introduction • System Design • Usage Model • Evaluation • Discussion • Summary

  20. Workflow • Want to diagnose a misbehavior: • temporal ambiguity: not punctual and seemingly cannot be reproduced • spatial ambiguity: cannot be localized to a component • Search Chopstix data for symptoms (unusual vital sign behaviors) • Given specific times, can zoom into corresponding epochs • Search these epochs first then previous epochs • Look for outliers in vital signs with threshold filters • Correlate candidate symptoms • Given a set of symptoms, addressing the problem: • Reproduce the problem by artificially triggering the symptoms • Or avoid the symptoms

  21. Correlating Symptoms • Interpreting symptoms is “art” • The paper describes a collection of 10 guiding rules for understanding symptoms

  22. Rules: • Rule #1: High CPU utilization (cpuu) with a low (nearly-zero) L2-cache-miss rate is a symptom of a busy loop • Rule #2: An increase in the net (user or kernel) memory allocated is an indicator of a memory leak • Rule #3: Unsatisfied I/O requests indicate bad blocks on the disk • size of data requested > size of data returned

  23. Rules: • Rule #4: When the combined value of cpuu for processes is low, scheduling delay (sched) is high, and the total cpuu is high for the system, it is often a sign of a kernel bottleneck • See paper for the others

  24. Case Study: observed behavior • Planetlab nodes were observed to crash every 1–7 days without leaving any information on the console or in the system logs. • Shortly before such crashes it was observed that ssh sessions to nodes would stall for tens of seconds. • Some nodes that were running an identical software stack did not suffer these crashes, indicating that the problem was load-dependent. • KDB, NMI watchdog not effective

  25. First Attempt • Resource blocking • High I/O activity • Find that request-response latencies stayed low and I/O thruput degraded negligibly during these periods • Not the problem

  26. Second Attempt • Rule #4 • High scheduling delays with heavy CPU utilization • A bug in the scheduling loop

  27. Other Examples • Brief descriptions of five other examples in the paper

  28. Outline • Introduction • System Design • Usage Model • Evaluation • Discussion • Summary

  29. Experimental Setup • Core2Duo, 4GB RAM • Linux 2.6.20.1 • NMI interrupts for every 107 CPU cycles and for every 6*105 L2-cache misses

  30. Aggregation/Visualization • Using the data set spanning three days, initialization cost: • ~80 seconds if not cached • ~3 seconds if cached

  31. Coverage of sketches • Compute the false negative probs • The formula is not explained • Vary hash table size and threshold value t in practice • Hash table size: 128—4KiB • False negative for each vital signs: 10-3 to 10-4

  32. Discussions • Properties of problems that may be diagnosed via Chopstix: • Impact system’s behavior • System must stay longer than one epoch • Applicable to other OS • General schemes should be portable • Implementation details may vary

  33. Summary • Chopstix: • Log succinct summaries • Low-level OS events • Sketch-based sampling • Real implementation • Experience with Planetlab • Guiding rules for interpreting vital signs

More Related