320 likes | 484 Views
Troubleshooting Wireless Mesh Networks. Victor Bahl bahl@microsoft.com joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April 1, 2004. Mesh Network Management.
E N D
Troubleshooting Wireless Mesh Networks Victor Bahl bahl@microsoft.com joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April 1, 2004
Mesh Network Management “Network management is a process of controlling a complex data network so as to maximize its efficiency and productivity” ISO’s definition of network management: • Fault Management • Configuration Management • Security Management • Performance management • Accounting
Goals Assist with Mesh Router configuration Reactive and Pro-active Trouble Shooting • Investigate reported performance problems • Time-series analysis to detect deviation from normal behavior • Localize and Isolate trouble spots • Collect and analyze traffic reports from mesh nodes • Determine possible causes for the trouble spots • Interference, or hardware problems, or network congestion, or malicious nodes …. Respond to troubled spots • Re-route traffic • Rate limit • Change topology via power control & directional antenna control • Flag environmental changes & problems
Nomenclature Mesh Management Module (M3) • Runs on every node Mesh Management Server (MMS) • Runs on gateway or designated nodes Mesh Network Management Protocol (MNMP) • Protocol (similar to SNMPv3) between M3 and MMS
Focus of this talk • Gathering & Distribution Data • Cleaning Data • Fault Isolation & Diagnosis
Challenges in Fault Diagnosis Characteristics of multi-hop wireless networks • Unpredictable physical medium, prone to link errors • Network topology is dynamic • Resource limitation calls for a diagnosis approach with low overhead • Vulnerable to link attacks Identifying root causes • Just knowing link statistics is insufficient • Signature Based Techniques don’t work well • Determining normal behavior is hard Handling multiple faults • Complicated interactions between faults and traffic, and among faults themselves
Previous Approaches to Fault Diagnosis Protocols for Network Management • ANMP [singh99] • Guerrilla [shen02] Detecting Routing and MAC misbehavior • Watchdog & pathrater [Baker00] • MACMis [Vaidya03] Fault Management in Infrastructure mode • AirWave, AirDefense, UniCenter, Symbol’s WNMS, IBM’s WSA, Wibhu’s SpetraMon, …
Our Approach Use a network simulator as a real-time diagnostic tool
Fault Detection, Isolation & Diagnosis Process ManagerModule DiagnoseFaults Root Causes MeasuredPerformance Raw Data CleanData Inject CandidateFaults Performance Estimate Collect Data Agent Module Routes Link Loads Signal Strength Simulate • SNMP MIBs • Performance Counters • WRAPI • MCL • NativeWiFi
Our Fault Diagnosis Framework Advantages • Flexible & customizable for a large class of networks • Captures complicated interactions within the network, between the network & environment, and among multiple faults • Extensible in its ability of detecting new faults • Facilitates what-if analysis Challenges • To accurately reproduce the behavior of the network inside a simulator • To build a fault diagnosis technique using the simulator as a diagnosis tool
Handling the Challenges Reproducing network behavior • Identify the set of traces to collect • Rule out erroneous data from the trace • Drive the simulator with the cleaned traces Building fault diagnosis • Use performance results from trace-driven simulation to establish the normal behavior • Deviation from the normal behavior indicates a potential fault • Identify root causes by efficiently search over fault space to re-produce faulty symptoms
Simulator Accuracy: RF Propagation RF propagation model versus measured signal strengths for IEEE 802.11a cards from different vendors
Simulator Accuracy: Throughput Estimated versus actual throughput when channel conditions are good (IEEE 802.11a)
Simulator Accuracy: Throughput (2) Estimated matches measured throughput till the channel conditions become poor
Simulator Accuracy: Throughput Estimated matches measured throughput for poor channel conditions when loss rate is incorporated
How Stable is the Channel? Good environmental conditions, received signal strength remains stable
Data Collection What should we collect? • Network Topology/Connectivity Info (Neighbor Table) • Noise level & signal strength • Traffic load to direct neighbor • Loss rate to direct neighbor (retransmission count)
Data Distribution Design Goal Minimize bandwidth consumption Techniques • Dynamic scoping • Each node takes a local view of the network • The coverage of the local view adapts to traffic patterns • Adaptive monitoring • Minimize measurement overhead in normal case • Change update period • Push and pull • Delta compression • Multicast
Management Overhead • Info distributed: • Routing changes • Traffic counters (e.g. pkts. sent & rcv.) • Signal Strength Avg: 1 to 5 hops 40 Kb/sec 25 Kb/sec 15 Kb/sec BW requirement does not go up much with network size
Data Cleaning Data may not be pristine. Why? • Liars, malicious users • Missing data • Measurement errors Clean the Data • Detect Liars • Assumption: most nodes are honest • Approach: • Neighborhood Watch • Find the smallest number of lying nodes to explain inconsistency in traffic reports • Smoothing & Interpolation
Example: Resiliency against Liars/Lossy Links Results Problem • Identify nodes that report incorrect information (liars) • Detect lossy links Assume • Nodes monitor neighboring traffic, build traffic reports and periodically share info. • Most nodes provide reliable information Challenge • Wireless links are error prone and unstable Approach • Find the smallest number of lying nodes to explain inconsistency in traffic reports • Use the consistent information to estimate link loss rates
Fault Diagnosis Algorithm 1. Initialization: diagnosed fault set F = { } 2. Forward addition while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) {Find a candiate fault that explains the mismatch between current and predicted performance the most, and add it to F } 3. Backward deletion while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) {Find a fault in F that explains the mismatch the least. Delete it from F if excluding it results in little change } 4. Report F
Performance 25 node random topology • Faults detected: • Random packet dropping • MAC misbehavior • External noise
What-if Analysis Improvement on removing flows
Thanks! http://www.research.microsoft.com/sn/mesh
Detection of Intentional Packet Drops Scenario - 49 node network - Randomly pick nodes that drop packets