Queries Over Streaming Sensor Data

Queries Over Streaming Sensor Data Samuel Madden Qualifying Exam University of California, Berkeley May 14th, 2002

Introduction • Sensor networks are here • Berkeley on the cutting edge • Data collection, monitoring are a driving application • My research • Query processing for sensor networks • Server (DBMS) side issues • In-network issues • Goal: Understand how to pose, distribute, and process queries over streaming, lossy, wireless, and power-constrained data sources such as sensor networks.

Overview • Introduction • Sensor Networks & TinyOS • Research Goals • Completed Research: Sensor Network QP • Central Query Processor • In Network, on Sensors • Research Plan • Future Implementation & Research Efforts • Time line • Related Work

Sensor Networks & TinyOS • A collection of small, radio-equipped, battery powered networked microprocessors • Typically Ad-hoc & Multihop Networks • Single devices unreliable • Very low power; tiny batteries or solar cells power for months • Berkeley’s Version: ‘Mica Motes’ • TinyOS operating system (services) • 4K RAM, 512K EEPROM, 128K code space • Lossy: 20% loss @ 5M in Ganesan et al. experiments • Communication Very Expensive • 800 instrs/bit xmitted • Apps: Environment Monitoring, Personal Nets, Object Tracking • Data processing plays a key role!

Overview • Introduction • Sensor Networks & TinyOS • Research Goals • Completed Research: Sensor Network QP • Central Query Processor • In Network, on Sensors • Visualizations • Research Plan • Future Implementation & Research Efforts • Time line • Related Work

Motivation • Why apply database approach to sensor network data processing? • Declarative Queries • Data independence • Optimization opportunities • Hide low-level complexities • Familiar Interface • Work sharing • Adaptivity • Proper interfaces can leverage existing database systems • TeleTiny architecture offers all of these • Suitable for a variety of lossy, streaming environments (not just TinyOS!) • Sharing & Adaptivity are Themes

Lots of help! Fjords: ICDE 2002, with Franklin CACQ: SIGMOD 2002, with Shah, Hellerstein, Raman, Franklin TAG: WMCSA 2002, with Szewczyk, Culler, Franklin, Hellerstein, Hong Catalog: with Hong Visualizations + Interfaces Completed Research Query Processor Partially Complete + Future Work CACQ Long running queries that share work Queries Answers User Workstation Sensor Proxy Mediate between sensor & QP Disk TAG In Network Aggregation Catalog + Sensor Schema Real world deployment @ Intel Berkeley TeleTiny Implementation Architecture Telegraph Fjords Handle push-based data

Sensor Network Query Processing Challenges • Query Processor Must Be Able To: • Tolerate lossy data delivery • Handle failure of individual data sources • Conserve power on devices whenever possible • Perhaps by using on-board processing • E.g. Applying selection predicates in network • Or by sharing work where ever possible • Handle push-based data • Handle streaming data

Visualizations + Simulations Query Processor Telegraph + CACQ Long running queries that share work Queries Fjords Handle push-based data Answers User Workstation Sensor Proxy Mediate between sensor & QP Disk Server-side Sensor QP • Mechanisms • Continuous Queries • Sensor Proxies • Fjord Query Plan Architecture • Stream Sensitive Operators

Continuous Queries (CQ) • Long running queries • User installs • Continuously receive answers until deinstallation • Common in streaming domain • Instantaneous snapshots don’t tell you much; may not be interested in history • Monitoring Queries • Examine light levels and locate rooms that are in use • Monitor the temperature in my workspace and adjust the temperature to be in the range (x,y)

Continuously Adaptive Continuous Queries (CACQ) • Given user queries over current sensor data • Expect that many queries will be over the same data sources (e.g. traffic sensors) • Queries over current data always looking at same tuples • Those queries can share • Current tuples • Work (e.g. selections) • Sharing reduces computation, communication • Continuously Adaptive • When sharing work, queries come and go • Over long periods of time, selectivities change • Assumptions that were valid at the start of the query no longer valid

R1 SELECT * FROM R WHERE s1(a),s2(b) a SELECT * FROM R WHERE s3(a),s4(b) SELECT * FROM R WHERE s5(a),s6(b) b R CACQ Overview S1 S3 R2 R1 R2 S5 R2 R2 R1 S2 R2 R1 R2 S4 R1 S6

CACQ - Adaptivity Niagara CQ Query 1 Query 2 A A A A D D D B B B B Reject? C C C Data Stream S Data Stream S Working Sharing via Tuple Lineage Q1: SELECT * FROM s WHERE A, B, CQ2: SELECT * FROM s WHERE A, B, D Conventional Queries Query 1 Query 2 s(C,D,B,A) s s s(C,D,B) s s s(C,D) s s s(C) s s() s s s Data Stream S

CACQ Contributions • Continuous adaptivity (operator reordering) via eddies • All queries within same eddy • Routing policies to enable that reordering • Explicit Tuple Lineage • Within each tuple, store where has been, where it must go • Maximizes sharing of tuples between queries • Grouped Filter • Predicate index that applies range & equality selections for multiple queries at the same time

Query 1 Query 3 Query 2 Stocks.sym = Articles.sym Stocks.sym = Articles.sym Stocks.sym = Articles.sym  UDF(stocks.sym)  UDF(stocks.sym)  UDF(stocks) Articles Articles Articles Stocks Stocks Stocks CACQ vs. NiagaraCQ • Performance Comparable for One Experiment in NCQ Paper • Example where CACQ destroys NCQ: |result| > |stocks| Expensive SELECT stocks.sym, articles.text FROM stocks,articles WHERE stocks.sym = articles.sym AND UDF(stocks)

CACQ vs. NiagaraCQ Graph

CACQ Review • Many Queries, One Eddy • Fine Grained Adaptivity • Grouped Filter Predicate Index • Tuple Lineage

Query Registration Parsed Queries [sources, ops] [fields, filters, aggregates, rates] Query [tuples] Sensor Proxy • CQ is a query processing mechanism; need to get data from sensors • Mediate between Sensors and Query Processor • Push operators out to sensors • Hide query processing, knowledge of multiple queries from sensors • Hide details of sensors from query processor • Enable power-sensitivity Query Processor

Fjording The Stream • Sensors, even through proxy, deliver data unusually • Query plan implementation • Useful for streams and distributed environments • Combine push (streaming) data and pull (static) data • E.g. traffic sensors with CHP accident reports

Summary of Server Side QP • CACQ • Enables sharing of work between long running queries • Enable adaptivity for long running queries • Sensor Proxy • Hides QP complexity from sensors, power issues from QP • Fjords • Enable combination of push and pull data • Non-blocking processing integral to the query processor SIGMOD ICDE

Query Processor Telegraph Queries Answers User Workstation TAG In Network Aggregation Catalog + Sensor Schema Real world deployment @ Intel Berkeley TeleTiny Implementation Sensor Side Sensor QP • Research thus far allows central QP to play nice with sensors • Doesn’t address how sensors can help with QP • Use their processors to processes queries • Advertise their capabilities and data sources • Control data delivery rates • Detect, report, and mitigate errors and failures • Two pieces thus far: • Tiny Aggregation (TAG) : WMCSA Paper, Resubmission in Progress • Catalog • Lots of work in progress!

Catalog • Problem: Given a heterogeneous environment full of motes, how do I know what data they can provide or process? • Solution: Store a small catalog on each device describing its capabilities • Mirror that catalog centrally to avoid overloading sensors • Enables data independence • Catalog Content: • For each attribute: • Name, Type, Size • Units (e.g. farenheit) • Resolution (e.g. 10 bits) • Calibration Information • Accessor functions • Cost information • Power, time, maximum sample rate

Tiny Aggregation (TAG) • How can sensors be leveraged in query processing? • Insight: Aggregate queries common case! • Users want summaries of information across hundreds or thousands of nodes • Information from individual nodes: • Often uninteresting • Could be expensive to retrieve at fine granularity • Take advantage of tree-based multihop routing • Common way to collect data at a centralized location • Combine data at each level to compute aggregates in network

Advantages of TAG • Order of magnitude decrease in communication for some aggregates • Streaming results: • Converge after transient errors • Successive results in half the messages of initial result • Reduces the burden on the upper levels of routing tree • Declarative queries enable: • Optimizations based on a classification of aggregate properties • Very simple to deploy, use

Query 1 1 1 1 2 3 2 2 2 3 3 3 4 5 6 4 4 4 5 5 5 6 6 6 TAG Example SELECT COUNT * FROM SENSORS

(5, 0, 1) 1 Sensor ID Epoch Count 2 3 4 5 6 TAG Example SELECT COUNT * FROM SENSORS Epoch: 0 (4, 0, 1) (5, 0, 1) (6, 0, 1)

(5, 0, 1) 1 Sensor ID Epoch Count 2 3 4 5 6 TAG Example SELECT COUNT * FROM SENSORS Epoch: 1 (2, 0, 2) (3, 0, 2) (4, 1, 1) (5, 1, 1) (6, 1, 1)

(5, 0, 1) 1 Sensor ID Epoch Count 2 3 4 5 6 TAG Example 1,0,6 SELECT COUNT * FROM SENSORS Epoch: 2 (2, 1, 3) (3, 1, 2) (4, 2, 1) (5, 2, 1) (6, 2, 1)

(5, 0, 1) 1 Sensor ID Epoch Count 2 3 4 5 6 TAG Example • Value at Root (d-1) Epochs Old • New Value Every Epoch • Nodes must cache old values 1,1,6 SELECT COUNT * FROM SENSORS Epoch: 3 (2, 2, 3) (3, 2, 2) (4, 3, 1) (5, 3, 1) (6, 3, 1)

TAG: Optimizations + Loss Tolerance • Optimizations to Decrease Message Overhead • When computing a MAX, nodes can suppress their own transmissions if they hear neighbors with greater values • Or, root can propagate down a ‘hypothesis’ • Suppress values that don’t change between epochs • Techniques to Handle Lossiness of Network • Cache child results • Send results up multiple paths in the routing tree • Grouping • Techniques for handling too many groups (aka group eviction)

Experiment: Basic TAG Dense Packing, Ideal Communication

Sensor QP Summary • In-Sensor Query Processing Consists of • TAG, for in-network aggregation • Order of magnitude reduction in communication costs for simple aggregates. • Techniques for grouping, loss tolerance, and further reduction in costs • Catalog, for tracking queryable attributes of sensors • In upcoming implementation • Selection predicates • Multiplexing multiple queries over network

What’s Left? • Development Tasks • TeleTiny Implementation • Sensor Proxy Policies & Implementation • Telegraph (or some adaptive QP) Interface • Research Tasks • Publish / Follow-on to TAG • Query Semantics • Real-world Deployment Study • Techniques for Reporting & Managing Resources + Loss

TeleTiny Implementation • In Progress (Goal: Ready for SIGMOD ’02 Demo) • In TinyOS, for Mica Motes, with Wei Hong & JMH • Features: • SELECT and aggregate queries processed in-network • Ability to query arbitrary attributes • Including power, signal strength, etc. • Flexible architecture that can be extended with additional operators • Multiple simultaneous queries • UDF / UDAs via VM • Status: • Aggregation & Selection engine built • No UDFs • Primitive routing • No optimizations • Catalog interface designed, stub implementation • 20kb of code space!

Sensor Proxy • Sensor Proxy Issues: • How to choose what runs on centrally and what runs on the motes? • Some operators obvious (e.g. join?): • Storage or computation demands preclude running in-network • Other operators there is a choice: • Limited resources mean motes will not have capacity for all pushable operators. • So which subset of operators to push?

Sensor Proxy (cont) • Cost-based query optimization problem; what to optimize? • Power load on network • Central CPU costs • Basic approach: • Push down as much as possible • Push high-update rate, low-state aggregate queries first • Benefit most from TAG • Satisfy other queries by sampling at minimum rate that can satisfy all queries, processing centrally

Research: Real World Study • Goal: Characterize performance of TeleTiny on a building monitoring network running in the Intel-Research Lab in the PowerBar™ building. • To: • Demonstrate effectiveness of our approach • Derive a number of important workload and real-world parameters that we can only speculate about • Be cool. • Also, Telegraph Integration, which should offer: • CACQ over real sensors • Historical data interface • Queries that combine historical data and streaming sensor data • Fancy adaptive / interactive features • E.g. adjust sample rates on user demand

Real World Study (Cont.) • Measurements to obtain: • Types of queries • Snapshot vs. continuous • Loss + Failure Characteristics • % lost messages, frequency of disconnection • Power Characteristics • Amount of Storage • Server Load • Variability in Data Rates • Is adaptivity really needed? • Lifetime of Queries

Research: Reporting & Mitigating Resource Consumption + Loss • Resource scarcity & loss are endemic to the domain • Problem: What techniques can be used to • Accommodate desired workload despite limited resources? • Mitigate + inform users of losses? • Key Issue because: • Dramatically affects usability of system • Otherwise users will roll-their-own • Dramatically affects quality of system • Results are poor without some additional techniques • Within themes of my research • Sharing of resources • Adaptivity to losses

Some Resource + Loss Tolerance Techniques • Identify locations of loss • E.g. annotate reported values with information about lost children • Provide user with tradeoffs for smoothing loss • TAG • Cache results: temporal smearing • Send to multiple parents: more messages, less variance • Or, as in STREAM project, compute lossy summaries of streams, • Offer user alternatives to unanswerable queries • E.g. ask if a lower sample rate would be OK? • Or if a nearby set of sensors would suffice? • Educate. (Lower expectations!) • Employ Admission Control, Leases

Timeline • May - June 2002: • Complete sensor-side software • Schema API • Catalog Server • UDFs • SIGMOD Demo • ICDE Paper on stream semantics • Resubmit TAG (to OSDI, hopefully.) • June - August 2002: • Telegraph Integration • Sensor proxy implementation • Instrument + Deploy Lab Monitoring, Begin Data Collection

Timeline (cont.) • August - November 2002 • Telegraph historical results integration / implementation • SIGMOD paper on Lab Monitoring deployment • August - January 2003 • Explore and implement mechanisms for handling resource constraints + faults • February 2003 • VLDB Paper on Resource Constraints • February - June 2003 • Complete Dissertation

Related Work • Database Research • Cougar (Cornell) • Sequences + Streams • SEQ (Wisconsin) + Temporal Database Systems • Stanford STREAM • Architecture similar to CACQ • State management • Query Semantics • Continuous Queries • NiagaraCQ (Wisconsin) • Psoup (Chandrasekaran & Franklin) • X/YFilter (Altinel & Franklin, Diao & Franklin) • Adaptive / Interactive Query Processing • CONTROL (Hellerstein, et. al) • Eddies (Avnur & Hellerstein) • Xjoin / Volcano (Urhan & Franklin, Graefe)

Related Work (Cont.) • Sensor / Networking Research • UCLA / ISI / USC (Estrin, Heidemann, et al.) • Diffusion: Sensor-fusion + Routing • Low-level naming: Mechanisms for data collection, joins? • Application specific aggregation • Impact of Network Density on Data Aggregation • Aka Greedy Aggregation, or how to choose a good topology • Network measurements (Ganesan, et al.) • MIT (Balakrishnan, Morris, et al.) • Fancy routing protocols (LEACH / Span) • Insights into data delivery scheduling for power efficiency • Intentional Naming System (INS) • Berkeley / Intel • TinyOS (Hill, et al.), lots of discussion & ideas

Summary • Query processing is a key feature for improving usability of sensor networks • TeleTiny Solution Brings: • On the query processor • Ability to combine + query data as it streams in • Adaptivity and performance • In the sensor network • Power efficiency via in-network evaluation • Catalog • Upcoming research work: • Real world deployment + study • Evaluation of techniques for resource usage + loss mitigation • TAG resubmission • Graduation, Summer 2003!

Queries Over Streaming Sensor Data

Queries Over Streaming Sensor Data

Presentation Transcript

Pathfinding Over Streaming Terrain

Sampling From a Moving Window Over Streaming Data

Streaming Queries over Streaming Data

Characterizing Memory Requirements for Queries over Continuous Data Streams

Data Management: Queries

Streaming Data, Continuous Queries, and Adaptive Dataflow

“ Data Gathering over Underwater Wireless Sensor Networks ”

Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks

Queries over Streaming Sensor Data

Supporting Aggregate Queries Over Ad-Hoc Wireless Sensor Networks

SIGMOD’03 Evaluating Probabilistic Queries over Imprecise Data

Fjording The Stream An Architecture for Queries over Streaming Sensor Data

Video Streaming over ProtoRINA

Sensor Networks: privacy-preserving queries

Efficient Evaluation of XQuery over Streaming Data

XPath Queries on Streaming Data

Answering Arbitrary Conjunctive Queries over Incomplete Data Stream Histories

Continuous Queries over Data Streams

Approximate Selection Queries over Imprecise Data

Streaming Queries over Streaming Data

dQUOB: SQL queries over data streams

Data Queries