Diagnosing Failures

Diagnosing Failures Dagstuhl 2010-10-28 Kenny Wong, U Alberta

Motivation • Need timely problem determination when failures happen • diagnose the root cause of a failure in an enterprise system • how to determine likely causes? • e.g., a slow report might mean a failing disk, network congestion, too many database entries, a revoked access right, etc. • often a manually intensive process across many sources of information October 28, 2010

Motivation • Various approaches to assisted problem determination • rule-based • model-based • impact and dependency analysis • Issues • how to populate? • how to evolve or refine? October 28, 2010

Approach • Idea • most failures have recurrent causes • offline:inject different kinds of faults on the system,consider the different effects on the resulting log files,produce a classifier using supervised learning • online:when a failure happens, run classifier on log files to suggest cause October 28, 2010

Approach • Challenges • what are all the possible (known) faults? • how to perturb the system with certain faults?(e.g., failing disk, flaky memory) • manifestation of a fault may take some time • processing large log files • what kind of classifier? • how to deal with high dimensionality data? October 28, 2010

Approach • Log abstraction technique • each log entry is not arbitrary text, but has structureConnection from 192.168.1.137 port 8080 • what part seems fixed? what part seems varying? • discover log entry patterns • use SLCT (clustering) algorithm by Vaarandi • turn log files into sequences of log entry pattern occurrences • ~280K log entries -- ~800 patterns October 28, 2010

Approach • Associative classifiers • generated classification rules relate effect <- causepattern: org.apache.derby.client.net.NetConnection40.setClientInfo (Unknown Source)<-injected fault:blocked database port[with probability 0.81]more rules … October 28, 2010

Early Experiments L. Huang et al. “Symptom-Based Problem Determination Using Log Data Abstraction”, CASCON 2010. October 28, 2010

Ending Thoughts • Recognizing “normal” versus failing operation • train also with no-fault logs • log-file-based “invariants” as classification rules • use classifier continuously (not only at failure time) • Going beyond log files to other system status • discover invariants of a system in terms of what it requires to run correctly(e.g., filesystem state, accounts, open ports, resources) • run “preflight” checks before and“inflight” checks during run time (V & V) October 28, 2010

October 28, 2010

Diagnosing Failures

Diagnosing Failures

Presentation Transcript

Diagnosing Depression

Diagnosing Dying

Diagnosing (Physician)

Diagnosing Organizations

Triage: Diagnosing Production Run Failures at the User’s Site

FAILURES

Diagnosing HIV

FAILURES

Diagnosing Parkinson’s

Disk Failures

Diagnosing asthma

Diagnosing Abnormality

FAILURES

Diagnosing (Physician)

Diagnosing HIV

Triage: Diagnosing Production Run Failures at the User’s Site

Diagnosing

Failures

Market Failures

Diagnosing Varroa

Diagnosing

Diagnosing HIV