Shimin Chen LBA Reading Group

Lightweight, High-Resolution Monitoring for Troubleshooting Production SystemsS. Bhatia, A. Kumar, M. E. Fiuczynski, L. Peterson (Princeton & Google), OSDI’08 Shimin Chen LBA Reading Group

Introduction • Troubleshooting complex software systems is difficult • Undetected software bugs • Intermittent hardware failures • Configuration errors • Unanticipated workloads

Why hard to reproduce and diagnose production-only problems? • Enough data for diagnosis • Un-expected faults • Collected all time for temporally distant faults and failures • Whole system: “domino effects” • System crashes and non-determinism • Low costs • Monitoring and diagnosis overhead • Modifications to applications or OS • Not require taking the system offline

New Tool: Chopstix • Data collection: • Continuously collects summaries of system’s behavior • Low-level OS operations with detailed contextual information • Keep data for weeks • <1% CPU utilization, 256KiB RAM, 16MiB/daily log • Aggregation and visualization

Help diagnose a class of problems • For example:“What was the system doing last Wednesday around 5pm when the ssh prompt latency was temporarily high, yet system load appeared to be low?

Contributions • How probabilistic data structure (sketches) enables detailed and lightweight data collection • How sysadmin can use this data to diagnose problems (e.g., PlanetLab)

Outline • Introduction • System Design (Section 2 & 4) • Usage Model (Section 3) • Evaluation (Section 5) • Discussion (Section 6) • Summary

System Components • Data collector • Implemented in the kernel • Companion user process periodically copies data from the kernel to disk • A polling process fetches data from every machine in a networked system to a central location • Aggregator • Visualizer

Data Collector • Goal: low resource usage & high coverage • Solution: sketch-based sampling • Problem of uniform sampling: • Infrequent events less likely to be recorded • The idea of sketch: • Sampling rate is adjusted for different event frequency • Here, “sketch” means event frequency

Data Collector: five steps • A trigger fires for a potentially interesting event; • The relevant event data structure is retrieved and a sketch is updated; • A sampling function is evaluated to determine if this event should be recorded; • If so, a sample of relevant information is saved; • A user-level process periodically polls the kernel for this data and resets the data structure, thereby defining a data collection epoch.

Event Triggers • A trigger fires for a potentially interesting event; • Polling has problems • Waste when no activity • Polling frequency may not be high enough if high activity • Triggers: • Instrument OS interfaces in kernel: • E.g., calls to page allocators • For HW stats, such as L2 misses, make the processor generate periodic interrupts. Interrupt handlers are the triggers.

The relevant event data structure is retrieved and a sketch is updated; Sketch Update • “sketch” means event frequency in a given data-collection epoch • To reduce space overheads, use a hash table for all sketches • hash key: event type, VM addr, executable identifier, uid, etc. (event type specific) • Experiments show low probability of false negatives • When an event is triggered, compute the hash index, increment the sketch with a weight • Weight: jiffies of the event • Longer events have larger weights

A sampling function is evaluated to determine if this event should be recorded; Logarithm Sampling Function • Goal: # samples of any one event = log (event frequency) • Implementation: • Choose an integer t=2k • If event frequency is power of t, record sample • In other words, if event frequency is an integer with k zeros, 2k zeros, 3k zeros, …, in the least significant bits, record sample • How to choose t? • Low watermark and high watermark (CPU cycles consumed by Chopstix) • Double t if > high watermark in previous epoch • Halve t if < low watermark in previous epoch

Event Sample • If so, a sample of relevant information is saved; • Stack trace, uid, program id, other event-specific details • Kernel stack trace: tracing frame pointer • User-level stack trace: similar • But stop if the stack is paged out

A user-level process periodically polls the kernel for this data and resets the data structure, thereby defining a data collection epoch. Epochs • Periodically copy samples and hash table out • Reinitialize hash table • Double copies of in-kernel Chopstix data structure for fast swap

Vital Signs (Event Types) • 11 vital signs: • CPU utilization • scheduling delay • resource blocking • disk I/O activity • user page allocation • kernel page allocation • L2-cache utilization • system call invocation • signal delivery • socket transmission • mutex/semaphore locking.

More details

Aggregator & Visualizer • Macromedia Flash web interface

Outline • Introduction • System Design • Usage Model • Evaluation • Discussion • Summary

Workflow • Want to diagnose a misbehavior: • temporal ambiguity: not punctual and seemingly cannot be reproduced • spatial ambiguity: cannot be localized to a component • Search Chopstix data for symptoms (unusual vital sign behaviors) • Given specific times, can zoom into corresponding epochs • Search these epochs first then previous epochs • Look for outliers in vital signs with threshold filters • Correlate candidate symptoms • Given a set of symptoms, addressing the problem: • Reproduce the problem by artificially triggering the symptoms • Or avoid the symptoms

Correlating Symptoms • Interpreting symptoms is “art” • The paper describes a collection of 10 guiding rules for understanding symptoms

Rules: • Rule #1: High CPU utilization (cpuu) with a low (nearly-zero) L2-cache-miss rate is a symptom of a busy loop • Rule #2: An increase in the net (user or kernel) memory allocated is an indicator of a memory leak • Rule #3: Unsatisfied I/O requests indicate bad blocks on the disk • size of data requested > size of data returned

Rules: • Rule #4: When the combined value of cpuu for processes is low, scheduling delay (sched) is high, and the total cpuu is high for the system, it is often a sign of a kernel bottleneck • See paper for the others

Case Study: observed behavior • Planetlab nodes were observed to crash every 1–7 days without leaving any information on the console or in the system logs. • Shortly before such crashes it was observed that ssh sessions to nodes would stall for tens of seconds. • Some nodes that were running an identical software stack did not suffer these crashes, indicating that the problem was load-dependent. • KDB, NMI watchdog not effective

First Attempt • Resource blocking • High I/O activity • Find that request-response latencies stayed low and I/O thruput degraded negligibly during these periods • Not the problem

Second Attempt • Rule #4 • High scheduling delays with heavy CPU utilization • A bug in the scheduling loop

Other Examples • Brief descriptions of five other examples in the paper

Outline • Introduction • System Design • Usage Model • Evaluation • Discussion • Summary

Experimental Setup • Core2Duo, 4GB RAM • Linux 2.6.20.1 • NMI interrupts for every 107 CPU cycles and for every 6*105 L2-cache misses

Aggregation/Visualization • Using the data set spanning three days, initialization cost: • ~80 seconds if not cached • ~3 seconds if cached

Coverage of sketches • Compute the false negative probs • The formula is not explained • Vary hash table size and threshold value t in practice • Hash table size: 128—4KiB • False negative for each vital signs: 10-3 to 10-4

Discussions • Properties of problems that may be diagnosed via Chopstix: • Impact system’s behavior • System must stay longer than one epoch • Applicable to other OS • General schemes should be portable • Implementation details may vary

Summary • Chopstix: • Log succinct summaries • Low-level OS events • Sketch-based sampling • Real implementation • Experience with Planetlab • Guiding rules for interpreting vital signs

Shimin Chen LBA Reading Group