1 / 35

Detailed and understandable network diagnosis

Detailed and understandable network diagnosis. Ratul Mahajan. With Srikanth Kandula, Bongshin Lee, Zhicheng Liu ( GaTech ), Patrick Verkaik (UCSD) , Sharad Agarwal, Jitu Padhye, Victor Bahl. Network diagnosis explains faulty behavior.

rea
Download Presentation

Detailed and understandable network diagnosis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detailed and understandable network diagnosis Ratul Mahajan With Srikanth Kandula, Bongshin Lee, Zhicheng Liu (GaTech), Patrick Verkaik (UCSD), Sharad Agarwal, Jitu Padhye, Victor Bahl

  2. Network diagnosis explains faulty behavior Starts with problem symptoms and ends at likely culprits File server Photo viewer Configuration User cannot accessa remote folder Configuration change denies permission

  3. Current landscape of network diagnosis systems Big enterprises Large ISPs Small enterprises Network size ?

  4. Why study small enterprise networks separately? Big enterprises Large ISPs Small enterprises IIS, SQL, Exchange, …

  5. Our work • Uncovers the need for detailed and understandable diagnosis • Develops NetMedic for detailed diagnosis • Diagnoses application faults without application knowledge • Develops NetClinicfor explaining diagnostic analysis

  6. Understanding problems in small enterprises Symptoms, root causes 100+ cases

  7. And the survey says ….. Handle app-specific as well as generic faults Identify culpritsat a fine granularity Detailed diagnosis

  8. Example problem 1: Server misconfig Browser Web server Server config Browser

  9. Example problem 2: Buggy client SQL client C1 SQL server Requests SQL client C2

  10. Example problem 3: Client misconfig Outlook config Exchange server Outlook config

  11. Current formulations sacrifice detail (to scale) • Dependency graph based formulations (e.g., Sherlock [SIGCOMM2007]) • Model the network as a dependency graph at a coarse level • Simple dependency model

  12. Example problem 1: Server misconfig Browser Web server Server config Browser The network model is too coarse in current formulations

  13. Example problem 2: Buggy client SQL client C1 SQL server Requests SQL client C2 The dependency model is too simple in current formulations

  14. Example problem 3: Client misconfig Outlook config Exchange server Outlook config The failure model is too simple in current formulations

  15. A formulation for detailed diagnosis Dependency graph of fine-grained components Component state is a multi-dimensional vector SQL client C1 Process OS Config Exch.svr IIS svr SQL svr SQL client C2 IIS config

  16. The goal of diagnosis Identify likely culprits for components of interest Without using semantics of state variables  No application knowledge C1 Process OS Config Svr C2

  17. Using joint historical behavior to estimate impact How “similar” on average states of D are at those times Identify time periods when state of S was “similar” D S C1 H Svr H L C2

  18. Robust impact estimation • Ignore state variables that represent redundant info • Place higher weight on state variables likely related to fault being diagnosed • Ignore state variables irrelevant to interaction with neighbor • Account for aggregate relationships among state variables of neighboring components • Account for disparate ranges of state variables

  19. Ranking likely culprits Path weight Global impact 0.8 B A 0.8 A 0.8 A A 0.8 0.2 0.2 A C D 1.8 B 0.8 B A C B C 0.8 C A 2.6 C A B A 0.4 D A 0.2 D A D

  20. Implementation of NetMedic Monitor components Diagnose edge impact path impact Target components Diagnosis time Reference time Component states Ranked list of likely culprits

  21. Evaluation setup IIS, SQL, Exchange, … . . . 10 actively used desktops Diverse set of faults observed in the logs

  22. NetMedic assigns low ranks to actual culprits

  23. NetMedic handles concurrent faults well 2 simultaneous faults

  24. Other empirical results Netmedic needs a modest amount (~60 mins) of history The key to effectiveness is correctly identifying many low impact edges It compares favorably with a method that understands variable semantics

  25. Unleashing (systems like) NetMedicon admins Rule based Inference based Accuracy State of the practice Research activity Fault coverage • How to present the analysis results? • Need human verification • (Fundamental?) trade-off between coverage and accuracy

  26. The understandability challenge • Admins should be able to verify the correctness of the analysis • Identify culprits themselves if analysis is incorrect • Two sub-problems at the intersection with HCI • Visualizing complex analysis (NetClinic) • Intuitiveness of analysis (ongoing work)

  27. NetClinic: Visualizing diagnostic analysis • Underlying assumption: Adminscan verify analysis if information is presented appropriately • They have expert, out-of-band information • Views diagnosis as multi-level analysis • Makes results at all levels accessible on top of a semantic graph layout • Allows top-down and bottom-up navigation across levels while retaining context

  28. NetClinic user study • 11 participants with knowledge of computer networks but not of NetMedic • Given 3 diagnostic tasks each after training • 88% task completion rate • Uncovered a rich mix of user strategies that the visualization must support

  29. Intuitiveness of analysis Understandability Accuracy • What if you could modify the analysis itself to make it more accessible to humans? • Counters the tendency to “optimize” for incremental gains in accuracy

  30. Intuitiveness of analysis (2) • Goal: Go from mechanical measures to more human centric measures • Example: MoS measure for VoIP • Factors to consider • What information is used? E.g., Local vs. global • What operations are used? E.g., Arithmetic vs. geometric means

  31. Conclusions Thinking small (networks) can provide new perspectives Understandability Accuracy Accuracy Accuracy Detail Coverage Coverage Detail Coverage NetMedic enables detailed diagnosis in enterprise networks w/o application knowledge NetClinic enables admins to understand and verify complex diagnostic analyses

More Related