Troubleshooting Wireless Mesh Networks

Troubleshooting Wireless Mesh Networks Victor Bahl bahl@microsoft.com joint work with Lili Qiu, Ananth Rao (UCB) & Lidong Zhou Microsoft Research April 1, 2004

Mesh Network Management “Network management is a process of controlling a complex data network so as to maximize its efficiency and productivity” ISO’s definition of network management: • Fault Management • Configuration Management • Security Management • Performance management • Accounting

Goals Assist with Mesh Router configuration Reactive and Pro-active Trouble Shooting • Investigate reported performance problems • Time-series analysis to detect deviation from normal behavior • Localize and Isolate trouble spots • Collect and analyze traffic reports from mesh nodes • Determine possible causes for the trouble spots • Interference, or hardware problems, or network congestion, or malicious nodes …. Respond to troubled spots • Re-route traffic • Rate limit • Change topology via power control & directional antenna control • Flag environmental changes & problems

Nomenclature Mesh Management Module (M3) • Runs on every node Mesh Management Server (MMS) • Runs on gateway or designated nodes Mesh Network Management Protocol (MNMP) • Protocol (similar to SNMPv3) between M3 and MMS

Focus of this talk • Gathering & Distribution Data • Cleaning Data • Fault Isolation & Diagnosis

Challenges in Fault Diagnosis Characteristics of multi-hop wireless networks • Unpredictable physical medium, prone to link errors • Network topology is dynamic • Resource limitation calls for a diagnosis approach with low overhead • Vulnerable to link attacks Identifying root causes • Just knowing link statistics is insufficient • Signature Based Techniques don’t work well • Determining normal behavior is hard Handling multiple faults • Complicated interactions between faults and traffic, and among faults themselves

Previous Approaches to Fault Diagnosis Protocols for Network Management • ANMP [singh99] • Guerrilla [shen02] Detecting Routing and MAC misbehavior • Watchdog & pathrater [Baker00] • MACMis [Vaidya03] Fault Management in Infrastructure mode • AirWave, AirDefense, UniCenter, Symbol’s WNMS, IBM’s WSA, Wibhu’s SpetraMon, …

Our Approach Use a network simulator as a real-time diagnostic tool

Fault Detection, Isolation & Diagnosis Process ManagerModule DiagnoseFaults Root Causes MeasuredPerformance Raw Data CleanData Inject CandidateFaults Performance Estimate Collect Data Agent Module Routes Link Loads Signal Strength Simulate • SNMP MIBs • Performance Counters • WRAPI • MCL • NativeWiFi

Root Cause Analysis Module

Our Fault Diagnosis Framework Advantages • Flexible & customizable for a large class of networks • Captures complicated interactions within the network, between the network & environment, and among multiple faults • Extensible in its ability of detecting new faults • Facilitates what-if analysis Challenges • To accurately reproduce the behavior of the network inside a simulator • To build a fault diagnosis technique using the simulator as a diagnosis tool

Handling the Challenges Reproducing network behavior • Identify the set of traces to collect • Rule out erroneous data from the trace • Drive the simulator with the cleaned traces Building fault diagnosis • Use performance results from trace-driven simulation to establish the normal behavior • Deviation from the normal behavior indicates a potential fault • Identify root causes by efficiently search over fault space to re-produce faulty symptoms

Why Simulator?

Simulator Accuracy: RF Propagation RF propagation model versus measured signal strengths for IEEE 802.11a cards from different vendors

Simulator Accuracy: Throughput Estimated versus actual throughput when channel conditions are good (IEEE 802.11a)

Simulator Accuracy: Throughput (2) Estimated matches measured throughput till the channel conditions become poor

Simulator Accuracy: Throughput Estimated matches measured throughput for poor channel conditions when loss rate is incorporated

How Stable is the Channel? Good environmental conditions, received signal strength remains stable

Data Collection What should we collect? • Network Topology/Connectivity Info (Neighbor Table) • Noise level & signal strength • Traffic load to direct neighbor • Loss rate to direct neighbor (retransmission count)

Data Distribution Design Goal Minimize bandwidth consumption Techniques • Dynamic scoping • Each node takes a local view of the network • The coverage of the local view adapts to traffic patterns • Adaptive monitoring • Minimize measurement overhead in normal case • Change update period • Push and pull • Delta compression • Multicast

Management Overhead • Info distributed: • Routing changes • Traffic counters (e.g. pkts. sent & rcv.) • Signal Strength Avg: 1 to 5 hops 40 Kb/sec 25 Kb/sec 15 Kb/sec BW requirement does not go up much with network size

Measurement Overhead on Throughput

Data Cleaning Data may not be pristine. Why? • Liars, malicious users • Missing data • Measurement errors Clean the Data • Detect Liars • Assumption: most nodes are honest • Approach: • Neighborhood Watch • Find the smallest number of lying nodes to explain inconsistency in traffic reports • Smoothing & Interpolation

Example: Resiliency against Liars/Lossy Links Results Problem • Identify nodes that report incorrect information (liars) • Detect lossy links Assume • Nodes monitor neighboring traffic, build traffic reports and periodically share info. • Most nodes provide reliable information Challenge • Wireless links are error prone and unstable Approach • Find the smallest number of lying nodes to explain inconsistency in traffic reports • Use the consistent information to estimate link loss rates

Fault Diagnosis Algorithm 1. Initialization: diagnosed fault set F = { } 2. Forward addition while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) {Find a candiate fault that explains the mismatch between current and predicted performance the most, and add it to F } 3. Backward deletion while (diff(MeasuredPerf, SimulatedPerf(F)) > threshold) {Find a fault in F that explains the mismatch the least. Delete it from F if excluding it results in little change } 4. Report F

Performance 25 node random topology • Faults detected: • Random packet dropping • MAC misbehavior • External noise

What-if Analysis Improvement on removing flows

Mesh Visualization Module

Thanks! http://www.research.microsoft.com/sn/mesh

Backup

Detection of Intentional Packet Drops Scenario - 49 node network - Randomly pick nodes that drop packets

Troubleshooting Wireless Mesh Networks

Troubleshooting Wireless Mesh Networks

Presentation Transcript

Troubleshooting Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Securing Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Troubleshooting Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks

Wireless Mesh Networks