Failure Data Collection and Analysis

Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Motivation (1)My machine crashes… Since 3/1/04… • 3 system crashes • 18 application errors • 96 application hangs Who cares? • I do! • People who share similar experiences • In general, customer uproar

Motivation (2)An Internet service has failures… Who cares? • Internet service users • Internet service system administrators • Anyone affected by the IS’s loss of revenue Total: 61 user-visible failures in 12 months at Online Service

Motivation (3) • ROC/RADS needs real failure/attack information • to drive benchmarks • evaluate our prototypes • help us select what we work attack

Data Sources • 1000s of individual machines • Cory/Soda Hall, BOINC • Large clusters at real Internet services • Internet services • Distributed applications on 100s of machines • PlanetLab

Individual Machines

Data Collection • Collect minidumps that contain… • The Stop message/parameters/data • Loaded drivers • Processor context for processor that stopped • Process info/kernel context for process/thread stopped • The Kernel-mode call stack for thread that stopped • Frequency of collection • synchronized with application and system crashes on computers

Analysis results • What happened that is immediately responsible for the crash • exact error code • brief description, primarily for debugging • Bucketing info, e.g.: "driver fault" • Details for debugging, e.g. stack contents • Use Microsoft’s publicly available analysis tools • Caveat: significant variability in results between internal and public version of tool!

How we collect minidumps (1) Corporate Error Reporting http://www.microsoft.com/resources/satech/cer/ • Manage error reports/msgs generated by WER and other programs • Configure clients to redirect reports to CER shared directory

Sample Statistics(25 nodes, 5 days)

How we collect minidumps (2) BOINC • For SETI@home –esque apps that pool resources • Provides client API to send/receive data to/from BOINC server • Write tools to read info in minidump directory and send to us

Sample Statistics (50 system crashes)

Sample Statistics (50 system crashes) • CLASSPNP.SYS 2 • win32k.sys 2 • SynTP.sys 1 • TDI.SYS 1 • ino_fltr.sys 1 • ks.sys 1 • drvnddm.sys 1 • ntkrnlmp.exe 1 • Pool_Corruption 1 • watchdog.sys 7 • ar5211.sys 6 • ibmpmdrv.sys 6 • ati3duag.dll 5 • SYMEVENT.SYS 3 • ipsecw2k.sys 3 • memory_corruption 3 • ialmdev5.DLL 2 • PSCRIPT4.DLL 2 • ntoskrnl.exe 2

Metrics (Windows & Linux) • Availability • system uptime, % time BOINC running • CPU(s) • # processes, processor queue length, % non-idle • Memory • available physical memory, free swap space • Disk(s) • free space • Network(s) • IP address, packets&bytes sent&received/sec, bandwidth to/from SETI@home server, first-hop bandwidth*, network coordinates* • Static • CPU type, #, and benchmarks; total memory; OS type

Questions • Other metrics? • Frequency with which to measure them? • What research questions can we answer with this data set? • original goal: workload to evaluate our node discovery service • evaluate effectiveness of network coordinates • evaluate potential to run more than just “embarrassingly parallel” apps on this type of infrastructure depending on • machines’ uptime • network connectivity • available disk space • distributed analysis? • security uses?

Internet Services

Data characteristics • Real companies • Multitude of users • Voluminous data (several terabytes) • Systems are complex • Treat as black box • Use SLT algorithms for analysis • More data => better models

Analysis Results • Study event logs • Not necessarily failures • Can derive models of good & bad behavior • Models with varying granularity • Use different algorithms • Vary boundary parameters • For more details see poster: “Towards a General Approach for Event Log Analysis”

Distributed Apps

PlanetLab • “An open platform for developing, deploying, and accessing planetary-scale services” • 392 nodes at 164 sites around the world • Per-site system administration • Applications: OceanStore, PIER

Why? • Platform for injecting faults and testing our algorithms • Applications on RADS-like environment • Research platform • More accessible • University-developed apps most likely to be tested on PlanetLab

Applications 1) OceanStore • Global persistent data store. • In the process of running prototype on PlanetLab • Good source of failure data 2) PIER • Distributed query processor • Currently running on PlanetLab • Good source of failure data + analysis engine

What do we do with these apps? • Instrument applications to collect any type of information • Choice of granularity • Open source - no longer black box • Can modify it as much as necessary

Questions • What other applications can we use? • What should we measure and model? • What information is useful for industry? • Do you have any failure/attack data you are willing to share with us?

Failure Data Collection and Analysis