1 / 10

Diagnosing Failures

Diagnosing Failures. Dagstuhl 2010-10-28. Kenny Wong, U Alberta. Motivation. Need timely problem determination when failures happen diagnose the root cause of a failure in an enterprise system how to determine likely causes?

nau
Download Presentation

Diagnosing Failures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Diagnosing Failures Dagstuhl 2010-10-28 Kenny Wong, U Alberta

  2. Motivation • Need timely problem determination when failures happen • diagnose the root cause of a failure in an enterprise system • how to determine likely causes? • e.g., a slow report might mean a failing disk, network congestion, too many database entries, a revoked access right, etc. • often a manually intensive process across many sources of information October 28, 2010

  3. Motivation • Various approaches to assisted problem determination • rule-based • model-based • impact and dependency analysis • Issues • how to populate? • how to evolve or refine? October 28, 2010

  4. Approach • Idea • most failures have recurrent causes • offline:inject different kinds of faults on the system,consider the different effects on the resulting log files,produce a classifier using supervised learning • online:when a failure happens, run classifier on log files to suggest cause October 28, 2010

  5. Approach • Challenges • what are all the possible (known) faults? • how to perturb the system with certain faults?(e.g., failing disk, flaky memory) • manifestation of a fault may take some time • processing large log files • what kind of classifier? • how to deal with high dimensionality data? October 28, 2010

  6. Approach • Log abstraction technique • each log entry is not arbitrary text, but has structureConnection from 192.168.1.137 port 8080 • what part seems fixed? what part seems varying? • discover log entry patterns • use SLCT (clustering) algorithm by Vaarandi • turn log files into sequences of log entry pattern occurrences • ~280K log entries -- ~800 patterns October 28, 2010

  7. Approach • Associative classifiers • generated classification rules relate effect <- causepattern: org.apache.derby.client.net.NetConnection40.setClientInfo (Unknown Source)<-injected fault:blocked database port[with probability 0.81]more rules … October 28, 2010

  8. Early Experiments L. Huang et al. “Symptom-Based Problem Determination Using Log Data Abstraction”, CASCON 2010. October 28, 2010

  9. Ending Thoughts • Recognizing “normal” versus failing operation • train also with no-fault logs • log-file-based “invariants” as classification rules • use classifier continuously (not only at failure time) • Going beyond log files to other system status • discover invariants of a system in terms of what it requires to run correctly(e.g., filesystem state, accounts, open ports, resources) • run “preflight” checks before and“inflight” checks during run time (V & V) October 28, 2010

  10. October 28, 2010

More Related