300 likes | 886 Views
AI Approaches to Network Fault Management. Andrew Learn 29 Nov 2001. Outline. Fault Management Process AI Approaches Expert Systems Neural Networks Case-based Reasoning. Network Faults. Hardware Wear and tear Cut cables Improper installation Software Incorrect design Bugs
E N D
AI Approaches to Network Fault Management Andrew Learn 29 Nov 2001
Outline • Fault Management Process • AI Approaches • Expert Systems • Neural Networks • Case-based Reasoning
Network Faults • Hardware • Wear and tear • Cut cables • Improper installation • Software • Incorrect design • Bugs • Incorrect data (e.g. routing tables)
Fault Management Process • Collect alarms • Filter and correlate alarms • Diagnose faults • Restoration and repair • Evaluate effectiveness
1. Collect Alarms • Types of alarms • Physical: Failure in communication • e.g. loss of signal, CRC failure • Logical: Statistical values exceed threshold • e.g. number of packets dropped • Communication with components • Control protocol: Simple Network Management Protocol (SNMP) • Data format: Management Information Base (MIB-II, 1990) has ~170 manageable objects
Sample MIB Entry • Sample SNMP “get” call ipInReceives OBJECT-TYPE SYNTAX Counter ACCESS read-only STATUS mandatory DESCRIPTION "The total number of input datagrams received from interfaces, including those received in error." ::= { ip 3 } snmpget netdev-kbox.cc.cmu.edu public system.sysUpTime.0 Name: system.sysUpTime.0 Timeticks: (2270351) 6:18:23
2. Filter and Correlate Alarms • Filter • Eliminate redundant alarms • Suppress noncritical alarms • Inhibit low-priority alarms in presence of high-priority alarms • Correlate • Analyze and interpret multiple alarms to assign new meaning (derived alarm)
3. Diagnose Faults • May require additional tests/diagnostics on circuits or components • Automated or manual • Analyze all info from alarms, tests, performance monitoring • Identify smallest system module that needs to be repaired or replaced
4. Restoration and Repair • Restoration: Continue service in presence of fault • Switch over to spares • Reroute around trouble spot • Restore software or data from backup • Repair • Replace parts • Repair cables • Debug software • Retest to verify fault is eliminated
5. Evaluate Effectiveness • Questions to answer : • How often do faults occur? • How many faults affect service? • How long is service interrupted? • How long to repair? • Provides assessment of: • Performance of fault management system • Reliability of equipment
AI Approaches to Fault Management • Well-developed approach: • Expert systems • New approaches: • Neural networks • Case-based reasoning • Other
Why AI? • Need for intelligence • Data analysis • Pattern recognition • Clustering and categorization • Problem solving • Need for automation • Manual analysis/solution takes time • Limited manpower • Limited expertise
Well-developed approach: Expert Systems • Expert systems = Rule-base + Working Memory • Three parts to rules: • Context trigger (when should rule be considered) • Condition ( if X . . . ) • Conclusion ( . . . then Y) • Used since 1980’s by major telecomm companies • Bell: Automated Cable Expertise (ACE) system • GTE: Central Office Maintenance Printout Analysis & Suggestion System (COMPASS) • AT&T: Network Management Expert System (NEMESYS)
Need for New Approaches • Weaknesses of expert systems • Brittle in unforeseen situations • Cannot learn from experience • Hard to maintain (adding/deleting/modifying rules) • Knowledge acquisition bottleneck • Can’t handle incomplete or probabilistic data • Factors driving new approach • Rapidly changing technology • Dynamic network topology • Network complexity • Competition, demand for QoS
Neural Nets • Structure: input, hidden, output layers • Training • Supervised: Input pattern & desired output • Unsupervised: Clustering of similar inputs weights Input Output Hidden
Neural Nets • Advantages • Pattern matching & generalization • Fast & efficient • Trainable • Handles incomplete, ambiguous data • Disadvantages • Black box • Lack of training data
Neural Net Example • Example: Alarm correlation in cell phone networks (Univ of Hannover, Germany) Maintenance Center MC BS1 Microwave Links BSC BS2 Base Station Controller Switching Centers Mobile units Base Stations
Neural Net Example • Test Results: • 94 alarms • 99.76% correct classification with up to 25% noise BSC alarms ML-1 fault . . . Initial Cause BS-1 alarms ML-2 fault . . . BS-2 alarms
Case-Based Reasoning • Case-based reasoning = matching previous examples • Case library: Set of previous faults, diagnoses, solutions • Usually based on “trouble ticket” help-desk databases • Design considerations: • What are key attributes of a case? • What attributes will be used to index & access a case?
Case-Based Reasoning • Advantages • Easier knowledge acquisition than expert systems • Can learn by adding new cases • Doesn’t require extensive maintenance • Disadvantages • Requires time-consuming user interaction • No help for first-time problems
Case-Based Reasoning Example Case 134 Problem Type: Performance Description: High error rate in comm between POA-SP & DF No access: Intermittent Retrieval: Case 103 [Similarity = 0.69] Description: 64kb line from VendorX drops big datagrams. Additional Info requested: Is there loss of big datagrams in ping test? (Result: Yes) Cause: Link 34 inside Bldg 207 was defective Solution: Vendor replaced cabling.
Summary of 3 AI Methods • Expert systems • If / then rules • Well-developed technology • Brittle, hard to maintain • Neural networks • Output = weighted transform of inputs • Fast pattern matching, robust to noise • Black box, lack of training data • Case-based systems • Trouble-ticket retrieval • Easy to build, maintain • Slower diagnosis, takes time to build
Other Approaches • Bayesian networks • Model statistical probabilities and dependence of faults • Mobile intelligent agents • Independent software agents cooperate to collect info, suggest solutions
Future Trends • Proactive fault detection • Recognizing trouble signs and taking corrective action before service degrades • Hybrid systems • Multiple AI methods integrated