430 likes | 571 Views
Memento: Efficient Monitoring of Sensor Network Health. Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006. “Sed quis custodiet ipsos custodes?” “But who watches the watchmen?” - Juvenal, Satire VI. Goals and Challenges of Monitoring. Goals
E N D
Memento:Efficient Monitoringof Sensor Network Health Stanislav Rost and Hari BalakrishnanCSAIL, MITSECON, September 2006
“Sed quis custodiet ipsos custodes?”“But who watches the watchmen?”- Juvenal, Satire VI
Goals and Challenges of Monitoring • Goals • Accuracy: minimize false alarms, maintenance • Timeliness: repair quickly, preserve sensor coverage • Efficiency: in power, bandwidth, to help longevity • Challenges • Packet loss: inherent to wireless medium • Dynamic routing topology:adapts to link quality • Resource constraints:internal monitoring is not the primary application
Memento Monitoring Suite Breakdown • Failure detection:which nodes have failed? • Collection protocol:gathering network-wide health status • Watchdogs • Logging • Remote inspection
Typical Sensornet Framework Data collection server • Assume routing protocol,optimized by path metric to root • Example metric: ETX • expected transmission count to reliably transfer a packet root node • Periodic communication • Protocol advertisements • Collection sweeps (1 per sweep period) Gateway node Sensornodes
? ? ? ? ? ? Two Modules of Memento Fail-stop failurenode permanently stops communicating (until reset or repaired) • Failure detectors • Track communication of a subset of neighbors • Detect failures • Form liveness beliefs • Collection protocol • Send liveness updates to the root • Aggregate along the way, vote on status by aggregation Heartbeatsperiodic beacons of other protocols; or Memento’s own Known period of transmissionPackets include the source address Scope of Opportunistic Monitoringall? children? some? Liveness Updatea bitmap s.t. bit k = 1 some node in subtree thinks node k is alive Failure Set Calculationat gateway,[roster – live]
Part I: Failure DetectionProblem Statement • Given a maximum false positive rate parameterdevelop a scheme which minimizes detection time • Using distributed failure detection:every node is a participantmay monitor a number of other nodes
Adaptive Failure Detectors • Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals • Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor • May produce false positives: beliefs that a node has failed when it is alive
Variance-Bound Detector • Samples and estimates mean, stdev of loss burst • Provides a guarantee on rate of false positives • Based on one-sided Chebyshev’s inequality: FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failure
Loose Bounds Lead to Long Timeouts • Chebyshev’s inequality provides the worst case for the extremes • Example data set:PMF of loss burst durations from a neighbortarget FP rate = 5%mean = 4.61 • stdev = 3.76 Heartbeat timeout =22
Empirical-CDF Detector • Samples gap lengths, maintains counters • If we want FPreq =X%, calculate an HTOthat has less than X% chance of occurring FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length
Same Example, Better Bound • Bounds the timeout by the outliers within the requisite percentile • Example data set:CDF of loss burst durations from a neighbortarget FP rate = 5% • Heartbeat timeout =12
Testing the Tradeoffs on theExperimental Testbed • Deployed 55-node testbed • 16,076 square feet • Implemented in TinyOS v1.4 • Runs on mica2 motes, crickets, EmStar
Failure Detector Comparison • 45 minutes in duration • Pick X nodes randomly (X {2,4,6,8}) • Schedule their failure at a random time • sweep=30 seconds • hb=10 seconds • Routing stability threshold = 1.5 • Run same failure schedule for all detector algos • Routing = ETX-based tree
Contenders • Direct-Heartbeat • Sends descendant’s liveness bitmaps to the root, with aggregation a la TinyDB • If root hears no update about X, assumes X is dead • Variance-Bound, 1% FP target • Each node monitors its children • Empirical-CDF, 1% FP target • Each node monitors its children • Opportunistic Variance-Bound, 1% FP • Each node monitors any neighbor whose packet loss < 30%
Explaining the Results • Empirical-CDF has trouble during the learning phase • The learning happens whenever a node gets new children • After another node has failed • After routing reconfiguration • Opportunistic monitoring inflates the detection time • Neighbors with higher loss need more time to achieve confidence in failure
Meeting the False Positive Guarantee • How far can we push our FP target?
Take-Home Lessons • 5x patience gets you 1000x confidence • Neighborhood opportunism is a must to make failure detection practically useful in wireless environments
Part II:Collecting the Network Status Aggregation • TinyAggregation[TinyDB]
Our Collection Protocol Memento • Parent caches result • Node sends an update only if its result or parent changes Aggregation
Collection Protocol Summary • Uses caching to suppress unnecessary communication • Network-associative cache coherence is tricky, we propose mechanisms to maintain it • Saves 80-90% bandwidth relative to state-of-the-art • More sensitive to the rate of change in the update than to routing reconfigurations
Conclusions • Memento collection protocol is very efficient in terms of bandwidth/energy, and well-suited for monitoring • [In paper] Monitoring more neighbors does not lead to better performance • New failure detectors, based on application needs • Need to use neighborhood opportunism to get acceptably low false positive rate
End of Talk • Questions?
Memento’s Approach to Cache Coherence • Children switch away? snoop routing packets with parent address • Node failures failure detectors clear the cache • Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s • Finite cache slots for child results? augment routing to subscribe to parents
Collection Protocol Evaluation • Sensitivity to rate of switching parents? • Use ETX, vary the stability threshold(the minimum improvement in “goodness” necessary to switch parents)
Collection Protocol Evaluation • Sensitivity to the rate of change in node results? • Fix the topology • Vary the fraction of nodes whose result changes every sweep
Related Work • Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05] • Nucleus[Tolle, Culler | EWSN ‘05] • TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03] • On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05] • Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]
More Memento • Symptom alerts: similar to liveness bitmaps • Watchdogs: core health metrics crossing danger thresholds trigger alarms • Logging to stable storage, to neighbors • Inspection: • Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods • Example app: detecting network partitioning • Node X dies, becomes point of fracture • Its parent P sends bitmap of children as “partitioned”
Future Work • State management in ad-hoc networks • Dynamic, yet stateful protocols • Working on: management of transfers of large samples • Static statistical properties of non-mobile deployments • Leverage models of group sampling to reduce redundancy, provide load-balancing • Working on: statistical modeling, building local models representative of global behavior
Simple Failure Detectors • “Direct-Heartbeat” • A neighbor is alive if one or more of its heartbeats is received since last sweep • A neighbor has failed if failure detector has missed last K consecutive heartbeats
Dilemma: False Failure Alarms vs Detection Time • Choose network-wide K given CDF of loss bursts:
Memento Performance Summary • Intended for non-mobile deployments • When node status fluctuates, approaches the costs of the cache-less scheme • Results so far for a long, narrow tree • 6 hops max depth • 2.5 average children
Scope of the Opportunism • Which neighbors are worth monitoring?
All Children Picking Neighbors to Monitor Pick neighbors whose heartbeat delivery probability > X