110 likes | 238 Views
Diagnosing Failures. Dagstuhl 2010-10-28. Kenny Wong, U Alberta. Motivation. Need timely problem determination when failures happen diagnose the root cause of a failure in an enterprise system how to determine likely causes?
E N D
Diagnosing Failures Dagstuhl 2010-10-28 Kenny Wong, U Alberta
Motivation • Need timely problem determination when failures happen • diagnose the root cause of a failure in an enterprise system • how to determine likely causes? • e.g., a slow report might mean a failing disk, network congestion, too many database entries, a revoked access right, etc. • often a manually intensive process across many sources of information October 28, 2010
Motivation • Various approaches to assisted problem determination • rule-based • model-based • impact and dependency analysis • Issues • how to populate? • how to evolve or refine? October 28, 2010
Approach • Idea • most failures have recurrent causes • offline:inject different kinds of faults on the system,consider the different effects on the resulting log files,produce a classifier using supervised learning • online:when a failure happens, run classifier on log files to suggest cause October 28, 2010
Approach • Challenges • what are all the possible (known) faults? • how to perturb the system with certain faults?(e.g., failing disk, flaky memory) • manifestation of a fault may take some time • processing large log files • what kind of classifier? • how to deal with high dimensionality data? October 28, 2010
Approach • Log abstraction technique • each log entry is not arbitrary text, but has structureConnection from 192.168.1.137 port 8080 • what part seems fixed? what part seems varying? • discover log entry patterns • use SLCT (clustering) algorithm by Vaarandi • turn log files into sequences of log entry pattern occurrences • ~280K log entries -- ~800 patterns October 28, 2010
Approach • Associative classifiers • generated classification rules relate effect <- causepattern: org.apache.derby.client.net.NetConnection40.setClientInfo (Unknown Source)<-injected fault:blocked database port[with probability 0.81]more rules … October 28, 2010
Early Experiments L. Huang et al. “Symptom-Based Problem Determination Using Log Data Abstraction”, CASCON 2010. October 28, 2010
Ending Thoughts • Recognizing “normal” versus failing operation • train also with no-fault logs • log-file-based “invariants” as classification rules • use classifier continuously (not only at failure time) • Going beyond log files to other system status • discover invariants of a system in terms of what it requires to run correctly(e.g., filesystem state, accounts, open ports, resources) • run “preflight” checks before and“inflight” checks during run time (V & V) October 28, 2010