Model-driven Anomaly Characterization for System Performance Debugging

I/O System Performance Debugging Using Model-driven Anomaly Characterization Kai Shen Ming Zhong Chuanpeng Li Dept. of Computer Science, Univ. of Rochester

Motivation • Implementations of complex systems (e.g., operating systems) contain performance “problems” • over-simplification, mishandling of special cases, … … • these problems degrade the system performance; make system behavior unpredictable • Such problems are hard to identify and understand for complex systems • many system features and configuration settings • dynamic workload behaviors • problems manifest under special conditions • Goal • comprehensively identify performance problems over wide ranges of system configurations and workload conditions FAST'05

Bird’s Eye View of Our Approach • Construct models to predict system performance • “simple”: modeling system components following their high-level design algorithms • “comprehensive”: considering wide ranges of system configuration and workload conditions • Model-driven anomaly characterization • Discover performance anomalies (discrepancies between model prediction and measured actual performance) • Characterize them and attribute them to possible causes • What can you do with the anomaly characterizations? • making system perform better and more predictable through debugging • identifying problematic settings for avoidance FAST'05

Operating System Support forDisk I/O-Bound Online Servers • Disk I/O-bound online servers • Server processing access large disk-resident data • Examples: • Web servers serving large Web data • index searching • database-driven server systems • Complex workload characteristics affecting performance • Operating system support • I/O prefetching • Disk I/O scheduling (elevator, anticipatory, …) • File system layout and meta-data management • Memory caching FAST'05

A “Simple” Yet “Comprehensive” Throughput Model • Decompose a complex system into weakly coupled sub-components (layers) • Each layer transforms the workload and alters the I/O throughput • Consider wide ranges of workloads and server concurrency FAST'05

Model-Driven Anomaly Characterization An OS implementation may deviate from model prediction • over-simplification, mishandling of special cases, … … • a “performance bug” may only manifest under specific system configurations or workload conditions FAST'05

We choose a set of system configurations and workload properties to check performance anomalies Sample parameters are chosen from a parameter space Parameter Sampling system configuration x system configuration y workload property z • If we choose samples randomly and independently, the chance for missing a bug decreases exponentially as the sample number increases FAST'05

Sampling Parameter Space • Workload properties • server concurrency • I/O access pattern • application inter-I/O think time • OS configurations • prefetching: enable (prefetching depth)/disable • I/O scheduling: elevator or anticipatory • memory caching: enable/disable FAST'05

Anomalous settings may be due to multiple causes (bugs) hard to make observation out of all anomalous settings desirable to cluster anomalous settings into groups likely attributed to individual causes Existing clustering algorithms (EM, K-means) do not handle cross-intersected clusters We perform hyper-rectangle clustering Anomaly Clustering system configuration x system configuration y workload property z FAST'05

Anomaly characterization hard to derive useful debugging information from a group of anomalous settings succinct characterizations are desirable Characterization is easy after hyper-rectangle clustering simply projecting the hyper-rectangle onto all dimensions Anomaly Characterization FAST'05

Experimental Setup • A micro-benchmark that can be configured to exhibit any desired workload patterns • Linux 2.6.10 kernel parameter sampling (400 samples) anomaly clustering and characterization for one possible bug human debugging (assisted by a kernel tracing tool) FAST'05

Result – Top 50 Model/Measurement Errors out of 400 Samples Measured throughput Model-predicted throughput Error defined as: 1 – 100% 80% Original Linux 2.6.10 60% #1 bug fix #1, #2 fixes Model/measurement error #1, #2, #3 fixes #1, #2, #3, #4 fixes 40% Performance error 20% 0% Sample parameter settings ranked on errors FAST'05

Workload property concurrency: 128 and above Stream length: 256KB and above System configuration Prefetching: enabled Result – Anomaly #1 • The cause • when the disk queue is “congested”, prefetching is cancelled • however, prefetching sometimes include synchronously requested data, which is resubmitted as single-page “makeup” I/O • Solutions • do not cancel prefetching that includes synchronously requested data • or block reads when the disk queue is “congested” FAST'05

Result – Anomaly #2, #3, #4 • Anomaly #2 • concerning the anticipatory I/O scheduler • uses average seek distance of past requests to estimate seek time • Anomaly #3 • concerning the elevator I/O scheduler • always search from block address 0 for next request after “reset” • Anomaly #4 • concerning the anticipatory I/O scheduler • a large I/O operation is often split into small disk requests, anticipation timer is started after the first disk request returns FAST'05

Result – Overall Predictability Model prediction Measured performance Original Linux 2.6.10 After four bug fixes 35 35 30 30 25 25 20 20 I/O throughput (in Mbytes/sec) I/O throughput (in Mbytes/sec) 15 15 10 10 5 5 I/O throughput (in MB/sec) I/O throughput (in MBytes/sec) 0 0 Ranked sample parameter settings Ranked sample parameter settings FAST'05

Support for Real Applications • Index searching from Ask Jeeves search engine • Search workload following a 2002 Ask Jeeves trace • Anticipatory I/O scheduler • Apache Web server • Media clips workload following IBM 1998 World Cup trace • Elevator I/O scheduler 30 15 25 20 10 15 I/O throughput (in Mbytes/sec) I/O throughput (in Mbytes/sec) 10 #1, #2, #4 bug fixes 5 #1, #3 bug fixes #1, #2 bug fixes 5 #1 bug fix #1 bug fix Original Linux 2.6.10 I/O throughput (in MB/sec) Original Linux 2.6.10 I/O throughput (in MB/sec) 0 0 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Server concurrency Server concurrency FAST'05

Related Work • I/O system performance modeling • Storage devices [Ruemmler & Wilkes 1994] [Kotz et al. 1994] [Worthington et al. 1994] [Shriver et al. 1998] [Uysal et al. 2001] • OS I/O subsystem [Cao et al. 1995] [Shenoy & Vin 1998] [Shriver et al. 1999] • Performance debugging • Fine-grain system instrumentation & simulation [Goldberg & Hennessy 1993] [Rosenblum et al. 1997] • Analyzing online traces [Chen et al. 2002] [Aguilera et al. 2003] • Correctness (non-performance) debugging • Code analysis [Engler et al. 2001] [Li et al. 2004] • Configuration debugging [Nagaraja et al. 2004] [Wang et al. 2004] FAST'05

Summary • Model-driven anomaly characterization • a systematic approach to assist performance debugging for complex systems over wide ranges of runtime conditions • for disk I/O-bound online servers, we discovered several performance bugs of Linux 2.6.10 kernel • Linux 2.6.10 kernel patch for bug fix #1 available • http://www.cs.rochester.edu/~cli/Publication/patch1.htm FAST'05

Model-driven Anomaly Characterization for System Performance Debugging