Implementation and Research Issues in Query Processing for Wireless Sensor Networks

Implementation and Research Issues in Query Processing for Wireless Sensor Networks Wei Hong Intel Research, Berkeley whong@intel-research.net Sam Madden MIT madden@csail.mit.edu Adapted by L.B.

Declarative Queries • Programming Apps is Hard • Limited power budget • Lossy, low bandwidth communication • Require long-lived, zero admin deployments • Distributed Algorithms • Limited tools, debugging interfaces • Queries abstract away much of the complexity • Burden on the database developers • Users get: • Safe, optimizable programs • Freedom to think about apps instead of details

TinyDB: Prototype declarativequery processor • Platform: Berkeley Motes + TinyOS • Continuous variant of SQL : TinySQL • Power and data-acquisition based in-network optimization framework • Extensible interface for aggregates, new types of sensors

Sensor Network TinyDB Revisited SELECT MAX(mag) FROM sensors WHERE mag > thresh SAMPLE PERIOD 64ms • High level abstraction: • Data centric programming • Interact with sensor network as a whole • Extensible framework • Under the hood: • Intelligent query processing: query optimization, power efficient execution • Fault Mitigation: automatically introduce redundancy, avoid problem areas App Query, Trigger Data TinyDB

Feature Overview • Declarative SQL-like query interface • Metadata catalog management • Multiple concurrent queries • Network monitoring (via queries) • In-network, distributed query processing • Extensible framework for attributes, commands and aggregates • In-network, persistent storage

Architecture TinyDB GUI JDBC TinyDB Client API DBMS PC side 0 Mote side 0 TinyDB query processor 2 1 3 8 4 5 6 Sensor network 7

Data Model • Entire sensor network as one single, infinitely-long logical table: sensors • Columns consist of all the attributes defined in the network • Typical attributes: • Sensor readings • Meta-data: node id, location, etc. • Internal states: routing tree parent, timestamp, queue length, etc. • Nodes return NULL for unknown attributes • On server, all attributes are defined in catalog.xml • Discussion: other alternative data models?

Query Language (TinySQL) SELECT <aggregates>, <attributes> [FROM {sensors | <buffer>}] [WHERE <predicates>] [GROUP BY <exprs>] [SAMPLE PERIOD <const> | ONCE] [INTO <buffer>] [TRIGGER ACTION <command>]

Comparison with SQL • Single table in FROM clause • Only conjunctive comparison predicates in WHERE and HAVING • No subqueries • No column alias in SELECT clause • Arithmetic expressions limited to column op constant • Only fundamental difference: SAMPLE PERIOD clause

TinySQL Examples SELECT nodeid, nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s “Find the sensors in bright nests.” Sensors 1

2 SELECT AVG(sound) FROM sensors EPOCH DURATION 10s • SELECT region, CNT(occupied) AVG(sound) • FROM sensors • GROUP BY region • HAVINGAVG(sound) > 200 • EPOCH DURATION 10s 3 Regions w/ AVG(sound) > 200 TinySQL Examples (cont.) “Count the number occupied nests in each loud region of the island.”

Event-based Queries • ON event SELECT … • Run query only when interesting events happens • Event examples • Button pushed • Message arrival • Bird enters nest • Analogous to triggers but events are user-defined

Query over Stored Data • Named buffers in Flash memory • Store query results in buffers • Query over named buffers • Analogous to materialized views • Example: • CREATE BUFFER name SIZE x (field1 type1, field2 type2, …) • SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO name • SELECT field1, field2, … FROM name SAMPLE PERIOD d

SELECT AVG(temp) WHERE light > 400 T:1, AVG: 225 T:2, AVG: 250 Queries Results Aggavg(temp) Name: temp Time to sample: 50 uS Cost to sample: 90 uJ Calibration Table: 3 Units: Deg. F Error: ± 5 Deg F Get f: getTempFunc()… got(‘temp’) get (‘temp’) Tables Samples getTempFunc(…) Inside TinyDB Multihop Network Query Processor ~10,000 Lines Embedded C Code ~5,000 Lines (PC-Side) Java ~3200 Bytes RAM (w/ 768 byte heap) ~58 kB compiled code (3x larger than 2nd largest TinyOS Program) Filterlight > 400 Schema TinyOS TinyDB

Q:SELECT … A Q Q R:{…} R:{…} Q B C Q Q Q Q R:{…} D R:{…} Q R:{…} Q Q Q F E Q Tree-based Routing • Tree-based routing • Used in: • Query delivery • Data collection • In-network aggregation • Relationship to indexing?

Sensor Network Research • Very active research area • Can’t summarize it all • Focus: database-relevant research topics • Some outside of Berkeley • Other topics that are itching to be scratched • But, some bias towards work that we find compelling

Topics • In-network aggregation • Acquisitional Query Processing • Heterogeneity • Intermittent Connectivity • In-network Storage • Statistics-based summarization and sampling • In-network Joins • Adaptivity and Sensor Networks • Multiple Queries

Tiny Aggregation (TAG) • In-network processing of aggregates • Common data analysis operation • Aka gather operation or reduction in || programming • Communication reducing • Operator dependent benefit • Across nodes during same epoch • Exploit query semantics to improve efficiency! Madden, Franklin, Hellerstein, Hong. Tiny AGgregation (TAG), OSDI 2002.

1 2 3 4 5 Basic Aggregation • In each epoch: • Each node samples local sensors once • Generates partial state record (PSR) • local readings • readings from children • Outputs PSR during assigned comm. interval • At end of epoch, PSR for whole network output at root • New result on each successive epoch • Extras: • Predicate-based partitioning via GROUP BY

1 2 3 4 5 Illustration: Aggregation SELECT COUNT(*) FROM sensors Interval 4 Sensor # Epoch Interval # 1

1 2 3 4 5 Illustration: Aggregation SELECT COUNT(*) FROM sensors Interval 3 Sensor # 2 Interval #

1 2 3 4 5 Illustration: Aggregation SELECT COUNT(*) FROM sensors Interval 2 Sensor # 1 3 Interval #

1 2 3 4 5 Illustration: Aggregation SELECT COUNT(*) FROM sensors Interval 1 5 Sensor # Interval #

1 2 3 4 5 Illustration: Aggregation SELECT COUNT(*) FROM sensors Interval 4 Sensor # Interval # 1

Aggregation Framework • As in extensible databases, TinyDB supports any aggregation function conforming to: Aggn={finit, fmerge, fevaluate} Finit {a0}  <a0> Fmerge {<a1>,<a2>}  <a12> Fevaluate {<a1>}  aggregate value Partial State Record (PSR) Example: Average AVGinit {v}  <v,1> AVGmerge {<S1, C1>, <S2, C2>}  < S1 + S2 , C1 + C2> AVGevaluate{<S, C>}  S/C Restriction: Merge associative, commutative

Taxonomy of Aggregates • TAG insight: classify aggregates according to various functional properties • Yields a general set of optimizations that can automatically be applied Drives an API!

R R R B B B C C C c/n c c/n A A A n = 2 Use Multiple Parents • Use graph structure • Increase delivery probability with no communication overhead • For duplicate insensitive aggregates, or • Aggs expressible as sum of parts • Send (part of) aggregate to all parents • In just one message, via multicast • Assuming independence, decreases variance SELECT COUNT(*) # of parents = n E(cnt) = n * (c/n * p2) Var(cnt) = n * (c/n)2 * p2 * (1 – p2) = V/n P(link xmit successful) = p P(success from A->R) = p2 E(cnt) = c * p2 Var(cnt) = c2 * p2 * (1 – p2) V

With Splitting No Splitting Critical Link! Multiple Parents Results • Better than previous analysis expected! • Losses aren’t independent! • Insight: spreads data over many links

Acquisitional Query Processing (ACQP) • TinyDB acquires AND processes data • Could generate an infinite number of samples • An acqusitional query processor controls • when, • where, • and with what frequency data is collected! • Versus traditional systems where data is provided a priori Madden, Franklin, Hellerstein, and Hong. The Design of An Acqusitional Query Processor. SIGMOD, 2003.

ACQP: What’s Different? • How should the query be processed? • Sampling as a first class operation • How does the user control acquisition? • Rates or lifetimes • Event-based triggers • Which nodes have relevant data? • Index-like data structures • Which samples should be transmitted? • Prioritization, summary, and rate control

Correct ordering (unless pred1 is very selective and pred2 is not): Traditional DBMS (pred2) (pred1) (pred1) Costly mag light (pred2) ACQP (pred2) (pred1) light Cheap mag mag light Operator Ordering: Interleave Sampling + Selection SELECT light, mag FROM sensors WHERE pred1(mag) AND pred2(light) EPOCH DURATION 1s At 1 sample / sec, total power savings could be as much as 3.5mW  Comparable to processor! • E(sampling mag) >> E(sampling light) • 1500 uJ vs. 90 uJ

WINMAX (mag>x) Traditional DBMS WINMAX mag (mag>x) ACQP (light > MAX) light mag light Exemplary Aggregate Pushdown SELECT WINMAX(light,8s,8s) FROM sensors WHERE mag > x EPOCH DURATION 1s • Novel, general pushdown technique • Mag sampling is the most expensive operation!

Topics • In-network aggregation • Acquisitional Query Processing • Heterogeneity • Intermittent Connectivity • In-network Storage • Statistics-based summarization and sampling • In-network Joins • Adaptivity and Sensor Networks • Multiple Queries

Heterogeneous Sensor Networks • Leverage small numbers of high-end nodes to benefit large numbers of inexpensive nodes • Still must be transparent and ad-hoc • Key to scalability of sensor networks • Interesting heterogeneities • Energy: battery vs. outlet power • Link bandwidth: Chipcon vs. 802.11x • Computing and storage: ATMega128 vs. Xscale • Pre-computed results • Sensing nodes vs. QP nodes

Computing Heterogeneity with TinyDB • Separate query processing from sensing • Provide query processing on a small number of nodes • Attract packets to query processors based on “service value” • Compare the total energy consumption of the network • No aggregation • All aggregation • Opportunistic aggregation • HSN proactive aggregation Mark Yarvis and York Liu, Intel’s Heterogeneous Sensor Network Project,ftp://download.intel.com/research/people/HSN_IR_Day_Poster_03.pdf.

5x7 TinyDB/HSN Mica2 Testbed

11% aggregators achieve 72% of max data reduction Optimal placement 2/3 distance from sink. Data Packet Saving • How many aggregators are desired? • Does placement matter?

Mobile GTWY Occasionally Connected Sensornets internet TinyDB Server GTWY TinyDB QP Mobile GTWY Mobile GTWY GTWY GTWY TinyDB QP TinyDB QP

Occasionally Connected Sensornets Challenges • Networking support • Tradeoff between reliability, power consumption and delay • Data custody transfer: duplicates? • Load shedding • Routing of mobile gateways • Query processing • Operation placement: in-network vs. on mobile gateways • Proactive pre-computation and data movement • Tight interaction between networking and QP Fall, Hong and Madden, Custody Transfer for Reliable Delivery in Delay Tolerant Networks, http://www.intel-research.net/Publications/Berkeley/081220030852_157.pdf.

Distributed In-network Storage • Collectively, sensornets have large amounts of in-network storage • Good for in-network consumption or caching • Challenges • Distributed indexing for fast query dissemination • Resilience to node or link failures • Graceful adaptation to data skews • Minimizing index insertion/maintenance cost

Functionality Efficient range query for multidimensional data. Approaches Divide sensor field into bins. Locality preserving mapping from m-d space to geographic locations. Use geographic routing such as GPSR. Assumptions Nodes know their locations and network boundary No node mobility E2= <0.6, 0.7> E1 = <0.7, 0.8> Q1=<.5-.7, .5-1> Example: DIM Xin Li, Young Jin Kim, Ramesh Govindan and Wei Hong, Distributed Index for Multi-dimentional Data (DIM) in Sensor Networks, SenSys 2003.

Statistical Techniques • Approximations, summaries, and sampling based on statistics and statistical models • Applications: • Limited bandwidth and large number of nodes -> data reduction • Lossiness -> predictive modeling • Uncertainty -> tracking correlations and changes over time • Physical models -> improved query answering

Correlated Attributes • Data in sensor networks is correlated; e.g., • Temperature and voltage • Temperature and light • Temperature and humidity • Temperature and time of day • etc.

IDSQ • Idea: task sensors in order of best improvement to estimate of some value: • Choose leader(s) • Suppress subordinates • Task subordinates, one at a time • Until some measure of goodness (error bound) is met • E.g. “Mahalanobis Distance” -- Accounts for correlations in axes, tends to favor minimizing principal axis See “Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks.” Chu, Haussecker and Zhao. Xerox TR P2001-10113. May, 2001.

Graphical Representation S1 S2 Residual 2 Residual 1 Preferred because it reduces error along principal axis Area of residuals is equal Model location estimate as a point with 2-dimensional Gaussian uncertainty. Principal Axis

MQSN: Model-based Probabilistic Querying over Sensor Networks Joint work with Amol Desphande, Carlos Guestrin, and Joe Hellerstein Query Processor Model 1 3 4 2 5 6 7 8 9

Probabilistic Query select NodeID, Temp ± 0.1C where NodeID in [1..9] with conf(0.95) Observation Plan [Temp, 3], [Temp, 9] MQSN: Model-based Probabilistic Querying over Sensor Networks Query Processor Model Consult Model 1 3 4 2 5 6 7 8 9

Data [Temp, 3] = …, [Temp, 9] = … MQSN: Model-based Probabilistic Querying over Sensor Networks Query Results Query Processor Model Update Model 1 3 4 2 5 6 7 8 9

Implementation and Research Issues in Query Processing for Wireless Sensor Networks