480 likes | 641 Views
Diagnosing and Debugging Wireless Sensor Networks. Eric Osterweil Nithya Ramanathan. Contents. Introduction Network Management Parallel Processing Distributed Fault Tolerance WSNs Calibration / Model Based Conclusion. What do apples, oranges, peaches have in common?.
E N D
Diagnosing and Debugging Wireless Sensor Networks Eric Osterweil Nithya Ramanathan
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What do apples, oranges, peaches have in common? Well, they are all fruits, they all grow in groves of trees, etc. However, grapes are also fruits, but they grown on vines! ;)
Defining the Problem • Debugging – an iterative process of detecting and discovering the root-cause of faults • Distinct debugging phases • Pre-deployment • During deployment • Post-deployment • Ongoing Maintenance / Performance Analysis – How different from debugging?
Characteristic Failures1,2 • Pre-Deployment • Bugs characteristic of wireless, embedded, and distributed platforms • During Deployment • Not receiving data at the sink • Neighbor density (or lack thereof) • badly placed nodes • Flaky/variable link connectivity 1 R. Szewczyk, J. Polastre, A. Mainwaring, D. Culler “Lessons from a Sensor Network Expedition”. In EWSN, 2004 2 A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler “Wireless Sensor Networks for Habitat Monitoring”. In ACM International Workshop on Wireless Sensor Networks and Applications.
Characteristic Failures (continued) • Post-Deployment • Failed/rebooted nodes • “Funny” nodes/sensors • batteries with low-voltage levels • Un-calibrated sensors • Ongoing Maintenance / Performance • Low bandwidth / dropped data from certain regions • High power consumption • Poor load-balancing, or high re-transmission rate
Scenarios • You have just deployed a sensor network in the forest, and are not getting data from any node – what do you do? • You are getting wildly fluctuating averages from a region – is this caused by • Actual environmental fluctuations • Bad sensors • Data randomly dropped • Calculation / algorithmic errors • Tampered nodes
Challenges • Existing tools fall-short for sensor networks • Limited visibility • Resource constrained nodes (Can’t run “gdb”) • Bugs characteristic of embedded, distributed, and wireless platforms • Can’t always use existing Internet fault-tolerance techniques (i.e. rebooting) • Extracting Debugging Information • With minimal disturbance to the network • Identifying information used to infer internal state • Minimizing central processing • Minimizing resource consumption
Challenges (continued) • Applications behave differently in the field • Testing configuration changes • Can’t easily log on to nodes • Identifying performance-blocking bugs • Can’t continually manually monitor the network (often physically impossible depending on deployment environment)
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What is Network Management? I don’t have to know anything about my neighbor to count on them…
Network Management • Observing and tracking nodes • Routers • Switches • Hosts • Ensuring that nodes are providing connectivity • i.e. doing their jobs
Problem • Connectivity failures versus device failures • Correlating outages with their cause(s)
Hosts Switches Core Switches Routers Outage Example
Approach • Polling • ICMP • SNMP • “Downstream event suppression” • If routing has failed, ignore events about downstream nodes • Modeling
Applied to WSNs • Similarities • Similar topologies • Intersecting operations • Network forwarding, routing, etc. • Connectivity vs. device failures • Differences • Network links • Topology dynamism
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What is Parallel Processing? If one car is fast, are 1,000 cars 1,000 times faster?
Parallel Processing • Coordinating large sets of nodes • Cluster sizes can range to the order of 104 nodes • Knowing nodes’ states • Efficient resource allocation • Low communication overhead
Problem • Detecting faults • Recovery of faults • Reducing communication overhead • Maintenance • Software distributions, upgrades, etc.
Approach • Low-overhead state checks • ICMP • UDP-based protocols and topology sensitivity • Ganglia • Process recovery • Process checkpoints • Condor
Applied to WSNs • Similarities • Potentially large sets of nodes • Relatively difficult to track state (due to resources) • Tracking state is difficult • Communication overheads are limiting
Applied to WSNs (continued) • Differences • Topology is more dynamic in WSNs • Communications are more constrained • Deployment is not structured around computation • Energy is limiting rather than computation overhead • WSNs are much less latency sensitive
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What is Distributed Fault Tolerance? Put me in coach… PUT ME IN!
Distributed Fault Tolerance • High Availability is a broad category • Hot backups (failover) • Load balancing • etc.
Problem(s) • HA • Track status of nodes • Keeping access to critical resources available as much as possible • Sacrifice hardware for low-latency • Load balancing • Track status of nodes • Keeping load even
Approach • HA • High frequency/low latency heartbeats • Failover techniques • Virtual interfaces • Shared volume mounting • Load balancing • Metric (Round robin, least connections, etc.)
Applied to WSNs • HA / Load balancing • Similarities • Redundant resources • Differences • Where to begin…MANY
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What are WSNs? Warning, any semblance of an orderly system is purely coincidental…
BluSH1 • Shell interface for Intel’s IMotes • Enables interactive debugging – can walk up to a mote and access internal state 1 Tom Schoellhammer
Sympathy1,2 • Aids in debugging • pre, during, and post-deployment • Nodes collect metrics & periodically broadcast to the sink • Sink ensures “good qualities” specified by programmer • based on metrics and other gathered information • Faults are identified and categorized by metrics and tests • Spatial-temporal correlation of distributed events to root-cause failures • Test Injection • Proactively injects network probes to validate a fault hypothesis • Triggers self-tests (internal actuation) 1 N. Ramanathan, E. Kohler, D. Estrin, "Towards a Debugging System for Sensor Networks", International Journal for Network Management, 2005. 2 N. Ramanathan, E. Kohler, L. Girod, D. Estrin. "Sympathy: A Debugging System for Sensor Networks". in Proceedings of The First IEEE Workshop on Embedded Networked Sensors, Tampa, Florida, USA, November 16, 2004
SNMS1 • Enables interactive health monitoring of WSN in the field • 3 Pieces • Parallel dissemination and collection • Query system for exported attributes • Logging system for asynchronous events • Small footprint / low overhead • Introduces overhead only with human querying 1Gilman Tolle, David Culler, “Design of an Application-Cooperative Management System for WSN” Second EWSN, Istanbul, Turkey, January 31 - February 2, 2005
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
What is Calibration and Modeling? Hey, if you and I both think the answer is true, then whose to say we’re wrong? ;)
Modeling1,2,3 • “Root-cause Localization” in large scale systems • Process of “identifying the source of problems in a system using purely external observations” • Identify “anomalous” behavior based on externally observed metrics • Statistical analysis and Bayesian networks used to identify faults 1E. Kiciman, A. Fox “Detecting application-level failures in component-based internet services”. In IEEE Transactions on Neural Networks, Spring 2004 2 A. Fox, E. Kiciman, D. Patterson, M. Jordan, R. Katz. “Combining statistical monitoring and predictable recovery for self-management”. In Procs. Of Workshop on Self-Managed Systems, Oct 2004 3 E. Kiciman, L Subramanian. “Root cause localization in large scale systems”
Calibration1,2 • Model physical phenomena in order to predict which sensors are faulty • Model can be based on: • Environment that is monitored – e.g. assume that the majority of sensors are providing correct data and then identify sensors that make this model inconsistent1 • Assumptions about the environment – e.g. in a densely sampled area, values of neighboring sensors should be “similar”2 • Debugging can be viewed as sensor network system calibration • Use system metrics instead of sensor data • Based on a model of what metrics in a properly behaving system should look like, can identify faulty behavior based on inconsistent metrics. • Locating and using ground truth • In situ deployments • Low communication/energy budgets • Bias • Noise 1Jessica Feng, S. Megerian, M. Potkonjak “Model-based calibration for Sensor Networks”. IEEE International Conference on Sensors, Oct 20032 A Collaborative Approach to In-Place Sensor Calibration – Vladimir Bychovskiy Seapahn Megerian et al
Contents • Introduction • Network Management • Parallel Processing • Distributed Fault Tolerance • WSNs • Calibration / Model Based • Conclusion
Promising Ideas • Management by Delegation • Naturally supports heterogeneous architectures by distributing control over network • Dynamically tasks/empowers lower-capable nodes using mobile code • AINs • Node can monitor its own behavior, detect, diagnose, and repair issues • Model-based fault detection • Models of physical environment • Bayesian inference engines
Comparison • Network Management • Close, but includes some inflexible assumptions • Parallel Processing • Many similar, but divergent constraints • Distributed Fault Tolerance • Almost totally different • WSNs • New techniques emerging • Calibration • WSN related work becoming available 1 F. Gump et al
Conclusion • Distributed debugging is as distributed debugging does1 • WSNs are a particular class of distributed system • There are numerous techniques for distributed debugging • Different conditions warrant different approaches • OR different spins to existing techniques 1 F. Gump et al
References • Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny, "Condor - A Distributed Job Scheduler", in Thomas Sterling, editor, Beowulf Cluster Computing with Linux, The MIT Press, 2002. ISBN: 0-262-69274-0 • http://www.open.com/pdfs/alarmsuppression.pdf • http://www.top500.org/ • .E. Culler and J.P. Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1999, ISBN 1-55860-343-3. • The Ganglia Distributed Monitoring System: Design, Implementation, and Experience.Matthew L. Massie, Brent N. Chun, and David E. Culler. Parallel Computing, Vol. 30, Issue 7, July 2004. • HA-OSCAR Release 1.0 Beta: Unleashing HA-Beowulf, 2nd Annual OSCAR symposium, Winnipeg, Manitoba Canada, May 2004 .
Questions? No? Great! ;)