410 likes | 532 Views
Improving Data Quality in Wireless Sensing Systems. Matthias Keller , Jan Beutel , Lothar Thiele. PermaSense Seminar, 10.08.2011. PermaSense Matterhorn Deployment. August 2008 – today Single base station Up to 24 sensor nodes TinyOS /Dozer [Burri2007] Constant rate
E N D
Improving Data Quality in Wireless Sensing Systems Matthias Keller, Jan Beutel, Lothar Thiele PermaSense Seminar, 10.08.2011
PermaSense Matterhorn Deployment • August 2008 – today • Single base station • Up to 24 sensor nodes • TinyOS/Dozer [Burri2007] • Constant rate • < 0.1 MByte/node/day
Sensor Data Outlier Filtering • A. Hasler: Threshold-based removal of bogus data, down sampling from 2 to 10 minutes sampling interval • Tolle et al.: Temperature measurements, outlier rejection based on battery voltage level • E. Elnahrawyet al.: Bayesian approach for cleaning noisy sensor data • H. Jeunget al.: Data cleaning with model-based anomaly detector • Necessary step to mitigate artifacts of faulty sensors • Usually done by scientific data user/domain expert • Must assume a certain input data quality
Currently Untouched Artifacts • We can observe • Packet duplicates • Node restarts • Order inconsistencies • Temporal vs. logical
The Observed Phenomena … • Modify results derived from the data • Statistics, observed sequences of states, … • Are unacceptable when data quality is key • Scientific modeling, early warning, … • Difficult to avoid in real sensor networks • Resource-scarcity, dynamics, multi-hop routing … Data cleaning and system validation on a higher layer • Removal of artifacts threatening data utility • Guarantees on data quality and data ordering Problem Statement
Goals of the Data Analysis • Validate packets based on a model of the real system • For valid packets • Add extra packet ordering information • Provide guarantees on time information • Mark other packets as non-conforming
Related Work • Logical notion of time • Lamport’s clock, vector clocks • Network time synchronization • NTP, FTSP, gradient clock sync, … • Data-driven time synchronization • Using microseismics[Lukac2009] • Using sunlight measurements [Gupchup2009] • Offline time reconstruction with Phoenix [Gupchup2010] • Sensor nodes exchange clock information during runtime
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Model of Multi-hop Data Collection • Periodic sampling • Sampling period T • Sequencing • Increasing sequence number • Resets on arithmetic overflow } } } } T T T T • Elapsed time on arrival • Sensor nodes measure packet sojourn time • Base station annotates packets with UTC timestamps a 4 sec c 4 sec 1 sec b 7 sec 6 sec 2011/04/14 10:03:31 – 7 sec = 2011/04/14 10:03:24 2 sec
Error Model • Clock drift • Affects measurement of • Sampling period T • Packet sojourn time ts • Indirectly leading to ordering inconsistencies • Temporal vs. logical • Node restarts • Cold restart: Power cycle • Soft restart: Watchdog reset } } T <T Shortens sampling period • Packet loss • Packet duplicates Lost 1-hop ACK 2 Node restart ✗ 1 ✗ ✗ Empty queue ✗ 3 Queue reset Retransmission
Formal System Model (1/2) Considering a single sensor node with source address o: • Abstract sequence counter: i • – at last cold restart: • Packet sequence number: • Sampling period: T • Clock drift and resolution: • Packet generation time:
Formal System Model (2/2) • Estimated sojourn time on node N: • Estimated total sojourn time: • Arrival time at base station: • Estimated generation time: • Maximum network diameter: • Error bounds on generation time calculation:
Data Processing • Input format: • Origin o, Sequence number s, total sojourn time , payload p, arrival time tb • Output format: • Unique packet identifier id reflects temporal order of generation • Bounds on packet generation time
Analysis Concepts • Remove uncertainty caused by sequence number • Assign packets to epochs • Determine unique packet id • Determine upper and lower bounds on generation time • Use forward and backward reasoning • Remove non-compliant packets • Duplicated packets • Incorrect time information • Behavior not covered by formal model problems:- arithmetic overflow- node restarts problems:- clock drift- node restarts
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Bounds on Packet Generation Time • Worst-case bounds for a single packet • Forward and backward reasoning is applied to tighten these bounds • Requirement: Exact ordering information
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Duplicate Filtering • We consider packets with • the same source address o • the same sequence number s • an equal payload p • We construct a graph G = (V, E) • Duplicate-free data set is achieved by only considering packets that are within the maximum independent set of G v w v
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Separate Data into Epochs • Observation: Sequence number s(i) resets to zero • Every smax packets due to arithmetical overflow • After a cold restart due to loss of state • After epoch assignment: • k “generated before” l id(k) < id(l) s(i) smax i e = 1 e = 2 e = 3 e = 4 e = 5
Epoch Assignment (1/3) • For each packet, calculate a reference point • Ideal case: Perfect clocks, absence of node restarts • Packets belonging to the same epoch have an equal reference point TC • Real case: Imperfect clocks, node restarts • Epoch assignment based on bound
Epoch Assignment (2/3) Theorem 1 All packets k, l that belong to the same epoch, i.e., e(k) = e(l), satisfy where where is an upper bound on the network sojourn time, i.e., and
Epoch Assignment (3/3) Theorem 2 Suppose that the generation period T satisfies Then all packets k, l that belong to different epochs, e.g., e(k) < e(l), satisfy Where is defined in Theorem 1.
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Forward and Backward Reasoning • Initially set worst-case bounds are often too pessimistic • Given the correct order of packet generation, initially set bounds can be improved by using information from temporarily adjacent packets • Example: A packet cannot be generated earlier than its predecessor i i-1 t
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
Matterhorn Deployment Data • Three phases of system operation • Initial difficulties with hardware and software • Non-conforming system operation • Sensor nodes subject to a high number of restarts • Daily shut down of base station due to insufficient energy
Model Validation (1/2) I) II) Model-based approach Unfiltered Data Verified Data Duplicate filtering Epoch assignment Violating packets Model Model ?= ?= ?= ?= # of sequence violations # of sequence violations
Model Validation (2/2) • Previously “dirty” data set has been restored for use • Appropriate method for continuous system validation
Conclusions • Data integrity testing and order reconstruction based on a system model of a real system • Give guarantees on data quality • Duplicate-free data • Correct temporal order of generation • Correct logical ordering • Proposed intermediate packet filtering step facilitates the usage of wireless sensor networks for applications that require highest data quality Matthias Keller, LotharThiele, Jan Beutel: Reconstruction of the Correct Temporal Order of Sensor Network Data, IPSN 2011, April 2011, pp. 282-293
Overview • System Model • Data Analysis • Bounds on packet generation time • Duplicate filtering • Epoch assignment • Forward and backward reasoning • Case Study with Model Validation • Recent Results & Usage Example 1 2 3 4
PermaDozer Performance Analysis • The received signal strength indicator (RSSI) is measured for every successful reception of a packet • Ratio between signal strength and noise floor • Higher ratio of duplicates at more challenging environments at Matterhorn and Jungfraujoch Matthias Keller, Matthias Woehrle, Roman Lim, Jan Beutel, Lothar Thiele: Comparative Performance Analysis of the PermaDozer Protocol in Diverse Deployments, SenseApp 2011, October 2011, accepted for publication
Sequence Meta-Data Usage Example • Query for unfiltered data • Filtered, ordered data with timestamp guarantees SELECT d.GENERATION_TIME, d.TEMPERATURE FROM nodehealth AS d ORDER BY d.GENERATION_TIME ASC Timestamp uncertainty SELECT d.GENERATION_TIME, d.TEMPERATURE, (s.GENERATION_TIME_UPPER-s.GENERATION_TIME_LOWER) as Q FROM nodehealth AS d JOIN nodehealth_sequence AS s USING (PK)ORDER BY s.ID ASC Inner table join includes only valid data Discrete index
Outlook • Data cleaning operations within GSN virtual sensors • Visualization with SensorViz plot application • Now live on http://data.permasense.ch Matthias Keller, Jan Beutel: Efficient Data Retrieval for Interactive Browsing of Large Sensor Network Data Sets (Demo), IPSN 2011, April 2011, pp. 139-140
Forward/Backward Reasoning Results • Intervals are tightened for 90% of the packets • Mean interval width is reduced by a factor of almost three
Proof Idea for ΔTC • Calculating TC(i) assumes packet generation every T • In practice, the mean distance over smax packets is • < T in the presence of node restarts and a faster clock • > T in the absence of node restarts and a slower clock • We need to bound • the minimal inter-arrival time of a warm restart • the maximum sojourn time of a packet
Validation Results • Only the model-based approach is able to clean data from the first phase A) of non-conforming system operation