Memento: Efficient Monitoring of Sensor Network Health

Memento:Efficient Monitoringof Sensor Network Health Stanislav Rost and Hari BalakrishnanCSAIL, MITSECON, September 2006

“Sed quis custodiet ipsos custodes?”“But who watches the watchmen?”- Juvenal, Satire VI

Goals and Challenges of Monitoring • Goals • Accuracy: minimize false alarms, maintenance • Timeliness: repair quickly, preserve sensor coverage • Efficiency: in power, bandwidth, to help longevity • Challenges • Packet loss: inherent to wireless medium • Dynamic routing topology:adapts to link quality • Resource constraints:internal monitoring is not the primary application

Memento Monitoring Suite Breakdown • Failure detection:which nodes have failed? • Collection protocol:gathering network-wide health status • Watchdogs • Logging • Remote inspection

Typical Sensornet Framework Data collection server • Assume routing protocol,optimized by path metric to root • Example metric: ETX • expected transmission count to reliably transfer a packet root  node • Periodic communication • Protocol advertisements • Collection sweeps (1 per sweep period) Gateway node Sensornodes

? ? ? ? ? ? Two Modules of Memento Fail-stop failurenode permanently stops communicating (until reset or repaired) • Failure detectors • Track communication of a subset of neighbors • Detect failures • Form liveness beliefs • Collection protocol • Send liveness updates to the root • Aggregate along the way, vote on status by aggregation Heartbeatsperiodic beacons of other protocols; or Memento’s own Known period of transmissionPackets include the source address Scope of Opportunistic Monitoringall? children? some? Liveness Updatea bitmap s.t. bit k = 1  some node in subtree thinks node k is alive Failure Set Calculationat gateway,[roster – live]

Part I: Failure DetectionProblem Statement • Given a maximum false positive rate parameterdevelop a scheme which minimizes detection time • Using distributed failure detection:every node is a participantmay monitor a number of other nodes

Adaptive Failure Detectors • Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals • Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor • May produce false positives: beliefs that a node has failed when it is alive

Variance-Bound Detector • Samples and estimates mean, stdev of loss burst • Provides a guarantee on rate of false positives • Based on one-sided Chebyshev’s inequality: FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failure

Loose Bounds Lead to Long Timeouts • Chebyshev’s inequality provides the worst case for the extremes • Example data set:PMF of loss burst durations from a neighbortarget FP rate = 5%mean = 4.61 • stdev = 3.76 Heartbeat timeout =22

Empirical-CDF Detector • Samples gap lengths, maintains counters • If we want FPreq =X%, calculate an HTOthat has less than X% chance of occurring FPreq: goal for maximum false positive rate Gi: number of consecutive missed heartbeats from neighbor i HTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length

Same Example, Better Bound • Bounds the timeout by the outliers within the requisite percentile • Example data set:CDF of loss burst durations from a neighbortarget FP rate = 5% • Heartbeat timeout =12

Testing the Tradeoffs on theExperimental Testbed • Deployed 55-node testbed • 16,076 square feet • Implemented in TinyOS v1.4 • Runs on mica2 motes, crickets, EmStar

Failure Detector Comparison • 45 minutes in duration • Pick X nodes randomly (X  {2,4,6,8}) • Schedule their failure at a random time • sweep=30 seconds • hb=10 seconds • Routing stability threshold = 1.5 • Run same failure schedule for all detector algos • Routing = ETX-based tree

Contenders • Direct-Heartbeat • Sends descendant’s liveness bitmaps to the root, with aggregation a la TinyDB • If root hears no update about X, assumes X is dead • Variance-Bound, 1% FP target • Each node monitors its children • Empirical-CDF, 1% FP target • Each node monitors its children • Opportunistic Variance-Bound, 1% FP • Each node monitors any neighbor whose packet loss < 30%

Evaluating Failure Detectors:False Positive Rate

Evaluating Failure Detectors:Detection Time

Explaining the Results • Empirical-CDF has trouble during the learning phase • The learning happens whenever a node gets new children • After another node has failed • After routing reconfiguration • Opportunistic monitoring inflates the detection time • Neighbors with higher loss need more time to achieve confidence in failure

Meeting the False Positive Guarantee • How far can we push our FP target?

Tradeoffs and Limits of Guarantees

Take-Home Lessons • 5x patience gets you 1000x confidence • Neighborhood opportunism is a must to make failure detection practically useful in wireless environments

Part II:Collecting the Network Status Aggregation • TinyAggregation[TinyDB]

Our Collection Protocol Memento • Parent caches result • Node sends an update only if its result or parent changes Aggregation

Collection Protocol Summary • Uses caching to suppress unnecessary communication • Network-associative cache coherence is tricky, we propose mechanisms to maintain it • Saves 80-90% bandwidth relative to state-of-the-art • More sensitive to the rate of change in the update than to routing reconfigurations

Conclusions • Memento collection protocol is very efficient in terms of bandwidth/energy, and well-suited for monitoring • [In paper] Monitoring more neighbors does not lead to better performance • New failure detectors, based on application needs • Need to use neighborhood opportunism to get acceptably low false positive rate

End of Talk • Questions?

Memento’s Approach to Cache Coherence • Children switch away? snoop routing packets with parent address • Node failures failure detectors clear the cache • Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s • Finite cache slots for child results? augment routing to subscribe to parents

Collection Protocol Modules

Collection Protocol Evaluation • Sensitivity to rate of switching parents? • Use ETX, vary the stability threshold(the minimum improvement in “goodness” necessary to switch parents)

Collection Protocol Performance vs Routing Stability

Collection Protocol Evaluation • Sensitivity to the rate of change in node results? • Fix the topology • Vary the fraction of nodes whose result changes every sweep

Collection Protocol Performance vs Rate of Change of State

Status Collection Byte Overhead

Related Work • Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05] • Nucleus[Tolle, Culler | EWSN ‘05] • TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03] • On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05] • Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]

More Memento • Symptom alerts: similar to liveness bitmaps • Watchdogs: core health metrics crossing danger thresholds trigger alarms • Logging to stable storage, to neighbors • Inspection: • Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods • Example app: detecting network partitioning • Node X dies, becomes point of fracture • Its parent P sends bitmap of children as “partitioned”

Future Work • State management in ad-hoc networks • Dynamic, yet stateful protocols • Working on: management of transfers of large samples • Static statistical properties of non-mobile deployments • Leverage models of group sampling to reduce redundancy, provide load-balancing • Working on: statistical modeling, building local models representative of global behavior

Simple Failure Detectors • “Direct-Heartbeat” • A neighbor is alive if one or more of its heartbeats is received since last sweep • A neighbor has failed if failure detector has missed last K consecutive heartbeats

Dilemma: False Failure Alarms vs Detection Time • Choose network-wide K given CDF of loss bursts:

Memento Performance Summary • Intended for non-mobile deployments • When node status fluctuates, approaches the costs of the cache-less scheme • Results so far for a long, narrow tree • 6 hops max depth • 2.5 average children

Scope of the Opportunism • Which neighbors are worth monitoring?

All Children Picking Neighbors to Monitor Pick neighbors whose heartbeat delivery probability > X

Tradeoffs in Failure Detection

Memento: Efficient Monitoring of Sensor Network Health

Memento: Efficient Monitoring of Sensor Network Health

Presentation Transcript

Sensor Network Applications for Environmental Monitoring

Routing Protocols in Underwater Sensor Networks

An Energy-Efficient MAC Protocol for Wireless Sensor Networks

Design of a Sensor Board for an Acoustic Traffic Monitoring System

Micro Wireless Sensor for Bearing Health Monitoring

eHealth Network Monitoring

Implementation of Decentralized Damage Localization in Wireless Sensor Networks

A Low Power Wireless Sensor Network for Gully Pot Monitoring in Urban Catchments

EESAA: Energy Efficient Sleep Awake Aware Intelligent Sensor Network Routing Protocol

Efficient Clustering for Improving Network Performance in Wireless Sensor Networks

Sensor/Actuator Network Calibration

Sensor Network Security

Sensor and energy-efficient networking

The Performance of a Wireless Sensor Network for Structural Health Monitoring

Introduction to EDG Fabric Monitoring

Wirelessly Powered Sensor Network

Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams

.Sense A Secure Framework for Sensor Network Data Acquisition, Monitoring and Command

Large – Scale Sensor network

DIONIS: efficient monitoring system for vineyards

Habitat Monitoring with Sensor Networks