1 / 66

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. Defense by Chen, Jiazhen & Chen, Shiqi. Introduction. Using a network-based service can be a frustrating experience….

adolfo
Download Presentation

Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Defense by Chen, Jiazhen & Chen, Shiqi

  2. Introduction • Using a network-based service can be a frustrating experience… Even inside the network of a single enterprise, where traffic does not cross the open Internet, user-perceptible service degradations are rampant.

  3. Introduction When 13% of the requests take 10x longer than normal and acceptable time to process, it’s time for us to do something.

  4. Introduction • Conventional approach: up OR down • Our approach: up, troubled or down. Response fall significantly outside of normal response time, including the case when only a subset of service requests are performing poorly.

  5. Sherlock • Detects the existence of faults and performance problems by monitoring the response time of services. • Determines the set of components that could be responsible. • Localizes the problem to the most likely component.

  6. Inference Graph and Ferret Algorithm • Inference Graph: Captures the dependencies between all components of the IT infrastructure. Determine the set of component • Ferret Algorithm: Efficiently localizes faults in enterprise-scale networks using the Inference graph and measurements of service response times made by agents. Localize the problem to the most likely component

  7. Tradeoffs • Sherlock falls short of full problem diagnosis. • It has not been evaluated in an environment that dependencies are deliberately and frequently changed. • … • BUT, measurements indicate that the vast majority of enterprise applications do not fall into these classes.

  8. Inference Graph in detail

  9. Node structure • Root-cause node: physical components whose failure can cause an end-user to experience failures. (two special root-cause nodes: AT and AD) • Observation node: accesses to network services whose performances can be measured by Sherlock. • Meta-nodes: model the dependencies between root causes and observations.

  10. Node State • The state of each node is expressed by a three-tuple: (Pup, Ptrouble, Pdown) • Pup +Ptrouble + Pdown = 1 • The state of the root-cause node is independent of any other node in the Inference Graph.

  11. Edges • Edge from node A to B: the dependency that node A has to be in the up state for node B to be up. • Each edge is given a dependency probability, indicating how strong the dependency is.

  12. Meta-Nodes and the Propagation of State Noisy-Max Meta-Nodes • Max: The node gets the worst condition of its parents. • Noisy: if the weight of a parent’s edge is d, then with probability (1-d) the child is not affected by that parent.

  13. Meta-Nodes and the Propagation of State Selector Meta-Nodes • Used to model load balancing scenarios. • NLB: Network Load Balancer • ECMP: A commonly used technique in enterprise networks where routers send packets to a destination along several paths.

  14. Meta-Nodes and the Propagation of State Failover Meta-Nodes • Failover: Clients access primary production servers and failover to backup servers when the primary server is inaccessible.

  15. Calculation of State Propagation For Noisy-Max Mega-Nodes: Time Complexity: O(n) instead of O(3n).

  16. Calculation of State Propagation For Selector and Failover meta-nodes: • Still needs O(3n) time. • HOWEVER, those two types of meta-nodes have no more than 6 parents.

  17. Fault Localization • Assignment-vector: An assignment of state to every root-cause node in the Inference Graph where the root-cause node has probability 1 of being either up, troubled or down. • Our target: Find the assignment-vector that can explain the observation best.

  18. Ferret Algorithm • Basically we need to compute a score for how well the vector explains the observations for EACH vector, and give the most possible results in sorting order. • But there are 3r possible vectors for r root-cause nodes, how to search through all these vectors?

  19. Ferret Algorithm • IMPORTANT OBSERVATION: It is very likely that at any point in time only a few root-cause nodes are troubled or down. • Ferret Algorithm evaluates the vectors with no more than k nodes troubled or down. So it processes at most (2*r)k vectors.

  20. Ferret algorithm • The approximation error of Ferret algorithm becomes vanishingly small for k=4 and onwards

  21. Ferret Algorithm • OBSERVATION: Since a root-cause is assigned to be up in most assignment vectors, the evaluation of an assignment vector only requires re-evaluation of states at the descendants of root-cause nodes that are not up. • So Ferret Algorithm can be further speeded up.

  22. Ferret Algorithm Score for a given assignment vector: • Track the history of response time and fits two Gaussian distribution to the data, namely Gaussianup and Gaussiantroubled. • If the observation time is t and the predicted observation node is (pup, ptroubled, pdown), then the score of this vector is calculated as: pup*Prob(t|Gaussianup) + ptroubled*Prob(t|Gaussiantroubled)

  23. Ferret Algorithm • We use a statistical test to determine if the prediction is sufficiently meaningful to deserve attention. • Null hypothesis: all root causes are up • Score(best prediction) – Score(null hypothesis)

  24. The Sherlock System Step1: Discover Service-Level Dependencies Step2: Construct the Inference Graph Step3: Fault Localization using Ferret Remark: Sherlock requires no changes to routers, applications, or middlewares used in the enterprise.

  25. Discovering Service-Level Dependencies • OBSERVATION: If accessing service B depends on service A, then packets exchanged with A and B are likely to co-occur. • Dependency probability: conditional probability of accessing service A within the dependency interval, prior to accessing service B. • Dependency interval: 10ms • Chance of co-occurrence: 10ms/I. We retain only the dependency probability which is much greater than the chance of co-occurrence.

  26. Aggregating Probabilities across Clients • OBSERVATION: many clients in an enterprise network have similar host, software and network configuration and are likely to have similar dependencies. • Aggregation allows us to deal with infrequent service and false dependencies better.

  27. Constructing the Inference Graph A: Create a noisy-max meta-node to represent the service. And create an observation node for each client.

  28. Constructing the Inference Graph B: Identify the set of services Ds that the clients are dependent on accessing S. And recurs.

  29. Constructing the Inference Graph C: Create a root-cause node to represent the host on which the service runs and makes this root-cause a parent of the meta-node.

  30. Constructing the Inference Graph D: Add network topology information to the Inference Graph. Add a noisy-max meta node to represent the path and a root-cause nodes to represent router and link.

  31. Constructing the Inference Graph E: Finally put AT and AD into the Inference Graph. Give each the edges connecting AT/AD to the observation point a weight 0.001. And give the edges between a router and a path meta-node a probability 0.999.

  32. Implementation • Implement Sherlock Agent as user-level daemon • Windows XP • Pcap-based sniffer

  33. Implementation • Centralized inference Engine • Aggregate information easily • Scalability • Extremely large network (105 agents) – bandwidth Req.10Mbps

  34. Implementation • Computational Complexity of fault localization • Linearly with graph size

  35. Evaluation • Deploy Sherlock system in our enterprise network • 40 servers • 34 routers • 54 IP links • 2 LANs • 3 Weeks • Agents periodically send requests to the web and file servers • Mimic user behavior • Controlled environments: Testbed and simulation

  36. Evaluation – Discovering Service Dependencies • Evaluate Sherlock’s algorithm • discovering Service-level dependency graphs • Two example: • Web Server • File Server • In the graph: • Arrows: • Point from server provides service to server rely on it. • Edges : • Represent strength of the dependencies with weights

  37. Evaluation – Discovering Service Dependencies • Service-level dependency graphs for visiting web portal and sales website • Clients depend on name lookup servers to access websites • Both websites share substantial portions of their back-end dependencies

  38. Evaluation – Discovering Service Dependencies • Service-level dependency graphs for visiting file server • Turns out file server is the root name of a distributed file system

  39. Evaluation – Discovering Service Dependencies • Summary • Service-level dependencies is various • Similar dependencies have different strength • Dependencies change over the time

  40. Evaluation – Discovering Service Dependencies • Number of samples • For about 4,000 samples, converged. • Once converged, service-level dependencies are stable over several days to weeks.

  41. Evaluation – Discovering Service Dependencies • Number of clients • Sherlock aggregates samples from more clients • Aggregate 20 clients reduce false-dependency greatly

  42. Evaluation – Localizing Faults • Evaluate “Ferret” • Ability to localize faults • Sensitivity to errors

  43. Evaluation – Localizing Faults • Testbeds • Three web servers, • SQL server, • Authentication server • Failed or Overloaded: • Overloaded link: 5% of packets drop • Overloaded server: high CPU and disk utilization

  44. Evaluation – Inference Graph • Service-level dependencies • Arrows at the bottom level

  45. Features of Ferret • Localizes faults • Correlating observations from multiple points • No “Threshold” setting • Clients observation of web server’s performance • Probabilistic Correlation Algorithm for reasoning • List of suspects of being the root cause • Handle multiple simultaneous failures

  46. Fault Localization Capability • Affected by • The number of vantage points • The number of observations available in time window • Having observations from the same time period that exercise all paths in the Inference Graph.

  47. Fault Localization Capability • In time window of 10 seconds • At least 70 unique clients access every one of the top 20 servers • Top 20 servers alone provide enough observations to localize faults

  48. Evaluation - Sherlock Field Deployment • The resulting Inference Graph contains: • 2565 nodes • 358 components • Can fail independently • Experiment: • Running Sherlock system over 120 hours • Found 1029 performance problems

  49. Evaluation - Sherlock Field Deployment • The result • X: time • Y: one component

  50. Evaluation - Sherlock Field Deployment • Ferret’s computation: • 87% of the problems are caused by 16 components • The Server1, Server2, and Link on R1 for instance

More Related