1 / 63

Internet Networking and Application Troubleshooting

Internet Networking and Application Troubleshooting. Yao Zhao EECS Department Northwestern University. Outline. Motivation Dissertation Overview Network Layer Troubleshooting VScope, Lend, FAD and SPA Application Layer Troubleshooting Rake Conclusions and Future Work. Motivation.

Anita
Download Presentation

Internet Networking and Application Troubleshooting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Internet Networking and Application Troubleshooting Yao Zhao EECS Department Northwestern University

  2. Outline • Motivation • Dissertation Overview • Network Layer Troubleshooting • VScope, Lend, FAD and SPA • Application Layer Troubleshooting • Rake • Conclusions and Future Work

  3. Motivation “When something breaks in the Internet, the Internet's very decentralized structure makes it hard to figure out what went wrong and even harder to assign responsibility.” - “Looking Over the Fence at Networks: A Neighbor's View of Networking Research”, by Committees on Research Horizons in networking, National Research Council, 2001.

  4. Troubleshooting Philosophy • Entity Oriented Troubleshooting • Monitor entity separately • E.g. Router packet drop rates, queue size and other SNMP counters • E.g. Machine CPU load, I/O intensity, network utility and other performance counters • Potential problems • Not all entities can be monitored • Inferring entity performance from the counters may be challenging

  5. Troubleshooting Philosophy • Entity Oriented Troubleshooting • Task Based Troubleshooting • Use task performance to infer entity performance • E.g. From Internet path loss rate to infer link-level loss rates • Advantage • Work with limited monitor points (e.g. end hosts) • Focus on target performance directly

  6. Thesis Statements • We design troubleshooting systems that monitor and diagnosis the Internet distribute systems in both network layer and application layer using the task based troubleshooting philosophy.

  7. Publications • Papers • Y. Zhao, Y. Chen, S. Ratnasamy, Load balanced and Efficient Hierarchical Data-Centric Storage in Sensor Networks, in the Proc. of SECON 2008 • Y. Gao, Y. Zhao, R. Schweller, S. Venkataraman, Y. Chen, D. Song, and M. Kao, Detecting Stealthy Spreaders Using Online Outdegree Histograms, in the Proc. of IWQoS, 2007 • Y. Zhao and Y. Chen, A Suite of Schemes for User-level Network Diagnosis without Infrastructure, in the Proc. of IEEE INFOCOM, 2007 • P. Narayana, R. Chen, Y. Zhao, Y. Chen, Z. Fu, and H. Zhou, Automatic Vulnerability Checking of IEEE 802.16 WiMAX Protocols through TLA+, in Proc. of NPSec, 2006 • Y. Zhao, Y. Chen, and D. Bindel, Towards Unbiased End-to-End Network Diagnosis, in Proc. of ACM SIGCOMM 2006 • Y. Zhao, Q. Zhang, B. Li, Y. Chen and W. Zhu, Hop ID based Routing in Mobile Ad Hoc Networks, in Proceedings of ICNP, 2005 • Patents • E. C. Gillum, Q. Ke, Y. Xie, F. Yu and Y. Zhao, Graph Based Bot-User Detection, being filed through Microsoft Corporation, MS docket number 324953.01. • J. Wang, Y. Chen, D. Pei, Y. Zhao, and Z. Zhu, Towards Efficient Large-Scale Network Monitoring and Diagnosis Under Operational Constraints, being filed through AT&T, docket number 1209-144.

  8. Outline • Motivation • Dissertation Overview • Network Layer Troubleshooting • VScope, Lend, FAD and SPA • Application Layer Troubleshooting • Rake • Conclusions and Future Work

  9. Motivation Diagnosis Application Transport Network Data Link Monitoring Model

  10. Components in Network Troubleshooting • Model • Defines the extrinsic observations and intrinsic faulty problems as well as the relationship between them • Monitoring • Collect the observations • Diagnosis • Identify the faulty location and find out the root cause

  11. VScope VScope Thesis Research Topics Lend, FAD and SPA Diagnosis Rake Monitoring Model Application Transport Network Data Link

  12. Outline • Motivation • Dissertation Overview • Network Layer Troubleshooting • VScope, Lend, FAD and SPA • Application Layer Troubleshooting • Rake • Conclusions and Future Work

  13. Network Layer Troubleshooting • LEND [Sigcomm06] • Tomography Diagnosis with least statistic assumptions • FAD & SPA [Infocom05] • On-demand loss rate diagnosis without infrastructure • VScope [Patent] • Experimental design for ISP VPN network monitoring and diagnosis

  14. LEND • Basic Assumptions • End-to-end measurement can infer the end-to-end properties accurately • Link level properties are independent • Problem Formulation • Given end-to-end measurements, what is the finest granularity of link properties can we achieve under basic assumptions? Better accuracy Basic assumptions More and stronger statistic assumptions Diagnosis granularity? Virtual link

  15. LEND • Contributions • Define the minimal identifiable unit under basic assumptions (MILS) • Prove that only E2E paths are MILS with a directed graph topology (e.g., the Internet) • Propose good path algorithm (incorporating measurement path properties) for finer MILS Better accuracy Basic assumptions More and stronger statistic assumptions Diagnosis granularity? Virtual link

  16. FAD & SPA • Motivation • How do end users, with no special privileges, identify packet loss inside the network with one or two computers? • Conclusions • We proposed three user-level loss rate diagnosis approaches • The combo of our approaches and Tulip [SOSP03] is much better than any single approach

  17. VScope Motivation • Two Important Services Provided by ISP • Internet access service • VPN service • Monitoring and Diagnosis on ISP Networks • Ensure Service Level Agreement (SLA) • Help Network Operations

  18. Problem Definition (1) • Challenges in ISP Network Monitoring and Diagnosis • Operational constraints on monitors and links • A monitor can measure a certain number of paths at a time • The measurement traffic through a link cannot exceed a threshold (e.g. 1% of the link bandwidth) • Path and monitor selection constraints • Monitor installation is costly • Real-time diagnosis • Special star-like topology features of ISP networks • Access links should be monitored • The backbone topology extended with access links (backboneExt) is large and star-like

  19. Problem Definition (2) • Monitor Setup Phase • From certain monitor candidates select minimal number of monitors, which in the measurement phase can measure a certain path set that covers all links in the network under the given measurement constraints • NP-hard even without considering constraints • Monitoring and Fault Diagnosis Phase • When faulty paths are discovered in the path monitoring phase, how to quickly select some paths under the operational constraints to be further measured so that the faulty link(s) can be accurately identified?

  20. Outline • Motivation • Dissertation Overview • Network Layer Troubleshooting • VScope, Lend, FAD and SPA • Application Layer Troubleshooting • Rake • Conclusions and Future Work

  21. Rake: Semantic Assisted Large Distributed System Diagnosis • Motivation • Related Work • Rake • Evaluation • Conclusions

  22. Motivation • Large distributed systems involve hundreds or thousands of nodes • E.g. search system, CDN • Host-based monitoring cannot infer the performance or detect bugs • Hard to translate OS-level info (such as CPU load) into application performance • Application log may not be enough • Task-based approach is adopted in many diagnosis systems • WAP5, Magpie, Sherlock

  23. Task-based Approaches • The Critical Problem – Message Linking • Link the messages in a task together into a path or tree

  24. Example of Message Linking in Search System URL URL URL Search keyword Search keyword Doc ID

  25. Task-based Approaches • The Critical Problem – Message Linking • Link the messages in a task together into a path or tree • Black-box approaches • Do not need to instrument the application or to understand its internal structure or semantics • Time correlation to link messages • Project 5, WAP5, Sherlock • White-box approaches • Extracts application-level data and requires instrumenting the application and possibly understanding the application's source codes • Insert a unique ID into messages in a task • X-Trace, Pinpoint

  26. Problems of Black-Box • Time Correlation • Affected by cross traffic

  27. Related Work Invasiveness Application Knowledge

  28. Rake • Key Observations • Generally no unique ID linking the messages associated with the same request • Exist polymorphic IDs in different stages of the request • Semantic Assisted • Use the semantics of the system to identify polymorphic IDs and link messages

  29. Message Linking Example URL URL URL Search keyword Search keyword Doc ID

  30. Questions on Semantics • What Are the Necessary Semantics? • In worst case, re-implement the application • How Does Rake Use the Semantics? • Naïve design is to implement Rake for each application with specific application semantics • How Efficient Is the Rake with Semantics • Can message linking to accurate? • What’s the computational complexity of Rake?

  31. Necessary Semantics • Intra-node linking • The system semantics • Inter-node link • The protocol semantics Node P Q R S

  32. Utilize Semantics in Rake • Implement Different Rakes for Different Application is time consuming • Lesson learnt for implementing two versions of Rake for CoralCDN and IRC • Design Rake to take general semantics • A unified infrastructure • Provide simple language for user to supply semantics

  33. Example of Rake Language (IRC) • <?xml version="1.0" encoding="ISO-8859-1"?> • <Rake> • <Message name="IRC PRIVMSG"> • <Signature> • <Protocol> TCP </Protocol> • <Port> 6667 </Port> • </Signature> • <Link_ID> • <Type> Regular expression </Type> • <Pattern> PRIVMSG\s+(.*) </Pattern> • </Link_ID> • <Follow_ID id="0"> • <Type> Same as Link ID </Type> • </Follow_ID> • <Query_ID> • <Type> No Return ID </Type> • </Query_ID> • </Message> • </Rake> Follow_ID Link_ID = Query_ID P Q = Response_ID S R

  34. Signature • Signature to Classify Messages • <Signature> • <Protocol> TCP </Protocol> • <Port> 6667 </Port> • </Signature> • Formats of Signatures • Socket information • Protocol, port • Expression for TCP/IP header • udp [10]&128==0 • Regular expression • User defined function

  35. Link_ID and Follow_ID • Follow_IDs • The IDs will be in the triggered messages by this message • One message may have multiple Follow_IDs for triggering multiple messages • Link_ID • The ID of the current message • Match with Follow_ID previously seen • Linking of Link_ID and Follow_ID • Mainly for intra-node message linking

  36. Query_ID and Response_ID • Query_IDs • The communication is in Query/Response style, e.g. RPC call and DNS query/response. • The IDs will be in the response messages to this message • Response_ID • The ID of the current message to match Query_ID previously seen • By default requires the query and response to use the same socket • Linking of Query_ID and Response_ID • Mainly for inter-node message linking

  37. Complicated Semantics • The process of generating IDs may be complicated • XML or regular expression is not good at complex computations • So let user provide own functions • User provide share/dynamic libraries • Specify the functions for IDs in XML • Implementation using Libtool to load user defined function in runtime

  38. Example for DNS • <?xml version="1.0" encoding="ISO-8859-1"?> • <Rake> • <Message name="DNS Query"> • <Signature> • <Protocol> UDP </Protocol> • <Port> 53 </Port> • <Expression> udp[10] & 128 == 0 </Expression> • </Signature> • <Link_ID > • <Type> User Function </Type> • <Libray> dns.so </Libray> • <Function> Link_ID </Function> • </Link_ID> • <Follow_ID id="0"> • <Type> Link_ID </Type> • </Follow_ID> • <Query_ID> • <Type> Link_ID </Type> • </Query_ID> • </Message> • …………………………….. Extract the queried host

  39. Accuracy Analysis • One-to-one ID Transforming • Examples • In search, URL -> Keywords -> Canonical format • In CoralCDN, URL -> Sha1 hash value • Ideally no error if requests are distinct • Request ambiguousness • Search keywords • Microsoft search data • Less than 1% messages with duplication in 1s • Web URL • Two real http traces • Less than 1% messages with duplication in 1s • Chat messages • No duplication with timestamps

  40. Potential Applications • Search • Verified by a Microsoft guy • CDN • CoralCDN is studied and evaluated • Chat System • IRC is tested • Distributed File System • Hadoop DFS is tested

  41. Evaluation • Application • CoralCDN • Deployed on PlanetLab • Experiment • Employ PlanetLab hosts as web clients • Retrieve URLs from real traces with different frequency • Metrics • Linking accuracy (false positive, false negative) • Diagnosis ability • Compared Approach • WAP5

  42. CoralCDN Task Tree

  43. Message Linking Accuracy • Rake Linking Accuracy is 100% for CoralCDN • Sha1 hash provides almost one-to-one URL to HashID mapping • The cache mechanism • If the same URL is received twice, the 2nd one will be blocked until the first one retrieves back the webpage • Use Rake Linking as Ground Truth to Evaluate WAP5

  44. Message Linking Accuracy (1) The higher request rate, the less accuracy in WAP5.

  45. Message Linking Accuracy (1) The higher request rate, the less accuracy in WAP5.

  46. Diagnosis Ability • Controlled Experiments • Inject junk CPU-intensive processes • Calculated the packet processing time using WAP5 and Rake Obviously Rake can identify the slow machine, while WAP5 fails.

  47. Discussion • Implementation Experience • How hard for user to provide semantics • CoralCDN – 1 week source code study • DNS – a couple of hours • Hadoop DFS – 1 week source code study • Inter-process Communication • Encryption • Dynamic library interposition

  48. Conclusions of Rake • Feasibility • Rake works for many popular applications in different categories • Easiness • Rake allows user to write semantics via XML • Necessary semantics are easy to obtained given our experience • Accuracy • Much more accurate than black-box approaches and probably matches white-box approaches

  49. Outline • Motivation • Dissertation Overview • Network Layer Troubleshooting • VScope, Lend, FAD and SPA • Application Layer Troubleshooting • Rake • Conclusions and Future Work

  50. Conclusions and Future Work • Demonstrate Task-based Troubleshooting Is Promising • Network layer troubleshooting • VScope, LEND, FAD and SPA • Application layer troubleshooting • Rake • Future Work • Extend Rake in diagnosis • Timeline for Thesis Writing • From present to Feb. 1

More Related