George J. Lee <gjl@mit> Advanced Network Architecture Group

CAPRI: A Common Architecture for Autonomous, Distributed Internet Fault Diagnosis using Probabilistic Relational Models George J. Lee <gjl@mit.edu> Advanced Network Architecture Group Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology

Automated Internet fault diagnosis is difficult DA • Knowledge, data, and reasoning are distributed • Agents need a common extensible language for expressing knowledge & data • Agents have incomplete information: • Agents must perform probabilistic diagnosis when evidence is unavailable • Distributed diagnosis is costly • Agents must minimize probing and communication cost Failure Report DA Data DA = Diagnostic Agent Diagnosis DA Reasoning Knowledge We need a Common Architecture for Probabilistic Reasoning in the Internet (CAPRI)

Overview • An extensible language for expressing diagnostic data & knowledge • Based on Bayes nets and Probabilistic Relational Models • Distributed probabilistic reasoning while minimizing probing and communication cost • Trading off accuracy and cost • Incorporating past evidence • Propagating evidence to other agents • Simulations: accuracy vs. cost • Learning diagnostic knowledge for real-world diagnosis • Passive diagnosis of HTTP proxy connections • Evaluation: accuracy using learned knowledge

Data = evidence about a particular failure Diagnostic test results Component status Diagnosis without domain-specific knowledge Allows distributed inference … A B C N Bayes nets can express diagnostic data IP Path … B-C Link CN Path A-B Link=FAIL A-B Link BN Path AN Path AN Probe

Knowledge = shared knowledge about component and test classes Class dependencies Diagnostic tests Agents generate Bayes net using PRM Provided by experts or learned by agents Extensible New component and test classes Subclassing (e.g. Wireless Link) Probabilistic Relational Models (PRMs) can express diagnostic knowledge Link Status First IP Path Status Rest Path Ping Test Result

Diagnosis Procedure: Receive failure report Construct Bayes net from PRM Incorporate current and past evidence using a Dynamic Bayes Net (DBN) Infer most probable explanation (MPE) for failure While mpe_confidence < confThresh: Perform local tests or request diagnosis from other agents to maximize relevance/cost Propagate evidence to other agents Return diagnosis Architectural points: Agents can trade off accuracy vs. cost using a confidence threshold Agents can infer current status from past evidence given a temporal failure model Agents can reduce load and improve robustness by propagating evidence Probabilistic models enable agents to reduce diagnosis cost Diagnosis cost = probing + communication cost

… … IP Path User A B K N Dest Minimizing cost for IP path diagnosis • IP path diagnosis: ISP (AB), rest of path (BN), or destination (NDest) • Simulated 6000 Autonomous System (AS) topology • 1 DA per AS that can test links and destinations associated with that AS • All diagnostic agents have knowledge of prior link failure probabilities • Diagnostic agents are reachable up to the point of failure • Status of inter-AS links and destination hosts drawn from prior probabilities • Evidence collection and propagation follow DAs in the AS path Evidence collection Failure report DA 1 User AS A DA 2 AS B DA k AS K DA n Dest AS N … … Diagnosis Evidence propagation

0.9 0.8 1.0 0.7 0.6 confThresh 0.4 0.5 Agents can trade off accuracy and cost • 13 confidence thresholds, 500 users, 5 trials

Incorporating past evidence reduces probing costs • cache duration = number of past time steps of evidence to consider • Inter-AS link failures modeled as a Markov chain (Gilbert model) • 100 users, 5 trials, 30 time steps • >95% accuracy

Evidence propagation reduces probing and communication costs • 10,000 users • 5 trials • 1 time step • >95% accuracy 50,000 failures 100,000 failures

TCP Overlay Path User Proxy Server Agents can learn probabilistic models for TCP overlay connection diagnosis • Learn inter-AS TCP failure probabilities from Planetseer (28.3 million TCP connections from 196 hosts over 10 hours) Src AS Dst AS Hour Src AS Dst AS Hour TCP Conn. UserProxy TCP Conn. ProxyServer HTTP Proxy Conn. UserServer 2. Diagnose HTTP proxy connections on CoDeeN without using probes

Learned diagnostic knowledge improves accuracy • Accuracy: 80% vs. 53% • Train on hour x • Test on hour x + 1 • Accuracy improves as training interval increases • Train on first x hours, test on hour x + 1 • Accuracy remains high as training set age increases • Train on hour 1, test on hour x > 1

Benefits of CAPRI • An extensible language for diagnostic data and knowledge • Based on Bayes nets and PRMs • Distributed diagnosis while minimizing probing and communication cost • accuracy/cost tradeoff • incorporating past evidence • evidence propagation • Robustness to missing data • probabilistic inference using cached data • Ability to learn diagnostic knowledge • learn conditional failure probabilities using PRMs

Future Work • Costs and incentives • Learning the true network costs of diagnostic tests • Dynamically adjusting cost • Incentives for agent to reveal evidence • Intelligent routing of diagnostic queries • Temporal failure models • Learning temporal failure models • Predicting failure duration • Diagnosis using data from end users

Modeling Dynamic networks • Model network component state as a Markov chain (Gilbert model) • Dynamic Bayes net (DBN): 0.03 OK FAIL 0.97 0.71 0.29 s1 s2 s3 P(s3=OK | s1=FAIL) = P(s3=OK | s2=OK)  P(s2=OK | s1=FAIL) + P(s3=OK | s2=FAIL)  P(s2=FAIL | s1=FAIL)

George J. Lee <gjl@mit> Advanced Network Architecture Group