1 / 43

Memento: Efficient Monitoring of Sensor Network Health

Memento: Efficient Monitoring of Sensor Network Health. Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006. “Sed quis custodiet ipsos custodes?” “But who watches the watchmen?” - Juvenal, Satire VI. Goals and Challenges of Monitoring. Goals

zocha
Download Presentation

Memento: Efficient Monitoring of Sensor Network Health

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memento:Efficient Monitoringof Sensor Network Health Stanislav Rost and Hari BalakrishnanCSAIL, MITSECON, September 2006

  2. “Sed quis custodiet ipsos custodes?”“But who watches the watchmen?”- Juvenal, Satire VI

  3. Goals and Challenges of Monitoring • Goals • Accuracy: minimize false alarms, maintenance • Timeliness: repair quickly, preserve sensor coverage • Efficiency: in power, bandwidth, to help longevity • Challenges • Packet loss: inherent to wireless medium • Dynamic routing topology:adapts to link quality • Resource constraints:internal monitoring is not the primary application

  4. Memento Monitoring Suite Breakdown • Failure detection:which nodes have failed? • Collection protocol:gathering network-wide health status • Watchdogs • Logging • Remote inspection

  5. Typical Sensornet Framework Data collection server • Assume routing protocol,optimized by path metric to root • Example metric: ETX • expected transmission count to reliably transfer a packet root  node • Periodic communication • Protocol advertisements • Collection sweeps (1 per sweep period) Gateway node Sensornodes

  6. ? ? ? ? ? ? Two Modules of Memento Fail-stop failurenode permanently stops communicating (until reset or repaired) • Failure detectors • Track communication of a subset of neighbors • Detect failures • Form liveness beliefs • Collection protocol • Send liveness updates to the root • Aggregate along the way, vote on status by aggregation Heartbeatsperiodic beacons of other protocols; or Memento’s own Known period of transmissionPackets include the source address Scope of Opportunistic Monitoringall? children? some? Liveness Updatea bitmap s.t. bit k = 1  some node in subtree thinks node k is alive Failure Set Calculationat gateway,[roster – live]

  7. Part I: Failure DetectionProblem Statement • Given a maximum false positive rate parameterdevelop a scheme which minimizes detection time • Using distributed failure detection:every node is a participantmay monitor a number of other nodes

  8. Adaptive Failure Detectors • Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals • Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor • May produce false positives: beliefs that a node has failed when it is alive

  9. Variance-Bound Detector • Samples and estimates mean, stdev of loss burst • Provides a guarantee on rate of false positives • Based on one-sided Chebyshev’s inequality: FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failure

  10. Loose Bounds Lead to Long Timeouts • Chebyshev’s inequality provides the worst case for the extremes • Example data set:PMF of loss burst durations from a neighbortarget FP rate = 5%mean = 4.61 • stdev = 3.76 Heartbeat timeout =22

  11. Empirical-CDF Detector • Samples gap lengths, maintains counters • If we want FPreq =X%, calculate an HTOthat has less than X% chance of occurring FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length

  12. Same Example, Better Bound • Bounds the timeout by the outliers within the requisite percentile • Example data set:CDF of loss burst durations from a neighbortarget FP rate = 5% • Heartbeat timeout =12

  13. Testing the Tradeoffs on theExperimental Testbed • Deployed 55-node testbed • 16,076 square feet • Implemented in TinyOS v1.4 • Runs on mica2 motes, crickets, EmStar

  14. Failure Detector Comparison • 45 minutes in duration • Pick X nodes randomly (X  {2,4,6,8}) • Schedule their failure at a random time • sweep=30 seconds • hb=10 seconds • Routing stability threshold = 1.5 • Run same failure schedule for all detector algos • Routing = ETX-based tree

  15. Contenders • Direct-Heartbeat • Sends descendant’s liveness bitmaps to the root, with aggregation a la TinyDB • If root hears no update about X, assumes X is dead • Variance-Bound, 1% FP target • Each node monitors its children • Empirical-CDF, 1% FP target • Each node monitors its children • Opportunistic Variance-Bound, 1% FP • Each node monitors any neighbor whose packet loss < 30%

  16. Evaluating Failure Detectors:False Positive Rate

  17. Evaluating Failure Detectors:Detection Time

  18. Explaining the Results • Empirical-CDF has trouble during the learning phase • The learning happens whenever a node gets new children • After another node has failed • After routing reconfiguration • Opportunistic monitoring inflates the detection time • Neighbors with higher loss need more time to achieve confidence in failure

  19. Meeting the False Positive Guarantee • How far can we push our FP target?

  20. Tradeoffs and Limits of Guarantees

  21. Tradeoffs and Limits of Guarantees

  22. Take-Home Lessons • 5x patience gets you 1000x confidence • Neighborhood opportunism is a must to make failure detection practically useful in wireless environments

  23. Part II:Collecting the Network Status Aggregation • TinyAggregation[TinyDB]

  24. Our Collection Protocol Memento • Parent caches result • Node sends an update only if its result or parent changes Aggregation

  25. Collection Protocol Summary • Uses caching to suppress unnecessary communication • Network-associative cache coherence is tricky, we propose mechanisms to maintain it • Saves 80-90% bandwidth relative to state-of-the-art • More sensitive to the rate of change in the update than to routing reconfigurations

  26. Conclusions • Memento collection protocol is very efficient in terms of bandwidth/energy, and well-suited for monitoring • [In paper] Monitoring more neighbors does not lead to better performance • New failure detectors, based on application needs • Need to use neighborhood opportunism to get acceptably low false positive rate

  27. End of Talk • Questions?

  28. Memento’s Approach to Cache Coherence • Children switch away? snoop routing packets with parent address • Node failures failure detectors clear the cache • Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s • Finite cache slots for child results? augment routing to subscribe to parents

  29. Collection Protocol Modules

  30. Collection Protocol Evaluation • Sensitivity to rate of switching parents? • Use ETX, vary the stability threshold(the minimum improvement in “goodness” necessary to switch parents)

  31. Collection Protocol Performance vs Routing Stability

  32. Collection Protocol Evaluation • Sensitivity to the rate of change in node results? • Fix the topology • Vary the fraction of nodes whose result changes every sweep

  33. Collection Protocol Performance vs Rate of Change of State

  34. Status Collection Byte Overhead

  35. Related Work • Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05] • Nucleus[Tolle, Culler | EWSN ‘05] • TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03] • On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05] • Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]

  36. More Memento • Symptom alerts: similar to liveness bitmaps • Watchdogs: core health metrics crossing danger thresholds trigger alarms • Logging to stable storage, to neighbors • Inspection: • Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods • Example app: detecting network partitioning • Node X dies, becomes point of fracture • Its parent P sends bitmap of children as “partitioned”

  37. Future Work • State management in ad-hoc networks • Dynamic, yet stateful protocols • Working on: management of transfers of large samples • Static statistical properties of non-mobile deployments • Leverage models of group sampling to reduce redundancy, provide load-balancing • Working on: statistical modeling, building local models representative of global behavior

  38. Simple Failure Detectors • “Direct-Heartbeat” • A neighbor is alive if one or more of its heartbeats is received since last sweep • A neighbor has failed if failure detector has missed last K consecutive heartbeats

  39. Dilemma: False Failure Alarms vs Detection Time • Choose network-wide K given CDF of loss bursts:

  40. Memento Performance Summary • Intended for non-mobile deployments • When node status fluctuates, approaches the costs of the cache-less scheme • Results so far for a long, narrow tree • 6 hops max depth • 2.5 average children

  41. Scope of the Opportunism • Which neighbors are worth monitoring?

  42. All Children Picking Neighbors to Monitor Pick neighbors whose heartbeat delivery probability > X

  43. Tradeoffs in Failure Detection

More Related