1 / 40

Approximate Aggregation Techniques for Sensor Databases

Approximate Aggregation Techniques for Sensor Databases. John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine, George Kollios, Feifei Li. Sensor Network Model.

elisha
Download Presentation

Approximate Aggregation Techniques for Sensor Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Approximate Aggregation Techniques for Sensor Databases John Byers Department of Computer Science Boston University Joint work with Jeffrey Considine, George Kollios, Feifei Li

  2. Sensor Network Model • Large set of sensors distributed in a sensor field. • Communication via a wireless ad-hoc network. • Node and links are failure-prone. • Sensors are resource-constrained • Limited memory, battery-powered, messaging is costly.

  3. Sensor Databases • Treat sensor field as a distributed database • But: data is gathered, not stored nor saved. • Perform standard queries over sensor field: • COUNT, SUM, GROUP-BY • Exemplified by work such as TAG and Cougar • For this talk: • One-shot queries • Continuous queries are a natural extension.

  4. Tiny Aggregation (TAG) Approach[Madden, Franklin, Hellerstein, Hong] • Aggregation component of TinyDB • Follows database approach • Uses simple SQL-like language for queries • Power-aware, in-network query processing • Optimizations are transparent to end-user. • TAG supports COUNT, SUM, AVG, MIN, MAX and others

  5. TAG (continued) • Queries proceed in two phases: • Phase 1: • Sink broadcasts desire to compute an aggregate. • Nodes create a routing tree with the sink as the root. • Phase 2: • Nodes start sending back partial results. • Each node receives the partial results of its children and computes a new partial result. • Then it forwards the new partial result to its parent. • Can compute any decomposable function • f (v1, v2, …, vn) = g( f (v1, .., vk), f (vk+1, …, vn))

  6. 4 3 3 9 4 4 4 1 2 3 3 1 2 2 2 2 1 1 Example for SUM sink • Sink initiates the query • Nodes form a spanning tree • Each node sends its partial result to its parent • Sink computes the total sum 20

  7. Classification of Aggregates • TAG classifies aggregates according to • Size of partial state • Monotonicity • Exemplary vs. summary • Duplicate-sensitivity • MIN/MAX (cheap and easy) • Small state, monotone, exemplary, duplicate-insensitive • COUNT/SUM (considerably harder) • Small state and monotone, BUT duplicate-sensitive • Cheap if aggregating over tree without losses • Expensive with multiple paths and losses

  8. Basic approaches to computing SUM • Separate, reliable delivery of every value to sink • Extremely costly in energy and energy consumption • Aggregate values back to sink along a tree • A single fault eliminates values of an entire subtree • “Split” values and route fractions separately • Send (value / k) to each of k parents • Better variance, but same expectation as approach (2) • Send values along multiple paths • Duplicates need to be handled. • <ID, value> pairs have limited in-network aggregation.

  9. Design Objectives for Robust SUM • Admit in-network aggregation of partial values • Let aggregates be both order-insensitive and duplicate-insensitive • Be agnostic to routing protocol • Trust routing protocol to be best-effort. • Routing and aggregation logically decoupled [NG ’03]. • Some routing algorithms better than others.

  10. Design Objectives (cont) • Final aggregate is exact if at least one representative from each leaf survives to reach the sink. • This won’t happen in practice. • It is reasonable to hope for approximate results. • We argue that it is reasonable to use aggregation methods that are themselves approximate.

  11. Outline • Motivation for sensor databases and aggregation. • COUNT aggregation via Flajolet-Martin • SUM aggregation • Experimental evaluation

  12. Flajolet / Martin sketches [JCSS ’85] • Goal: Estimate N from a small-space representation of a set. • Sketch of a union of items is the OR of their sketches • Insertion order and duplicates don’t matter! Prerequisite:Let h be a random, binary hash function. Sketch of an item For each unique item with ID x, For each integer 1 ≤ i ≤ k in turn, Compute h (x, i). Stop when h (x, i) = 1, and set bit i. ∩

  13. Flajolet / Martin sketches (cont) Estimating COUNT Take the sketch of a set of N items. Let j be the position of the leftmost zero in the sketch. j is an estimator of log2 (0.77 N) S 1 1 1 0 1 j = 3 Best guess: COUNT ~ 11 • Fixable drawbacks: • Estimate has faint bias • Variance in the estimate is large.

  14. Flajolet / Martin sketches (cont) • Standard variance reduction methods apply. • Compute m independent sketches in parallel. • Compute m independent estimates of N. • Take the mean of the estimates. Provable tradeoffs between m and variance of the estimator

  15. Application to COUNT • Each sensor computes k independent sketches of itself (using unique ID x) • Coming next: sensor computes a sketch of its value. • Use a robust routing algorithm to route sketches up to the sink. • Aggregate the k sketches via union en-route. • The sink then estimates the count.

  16. Multipath Routing Braided Paths: Two paths from the source to the sink that differ in at least two nodes

  17. Routing Methodologies • Considerable work on reliable delivery via multipath routing • Directed diffusion [IGE ’00] • “Braided” diffusion [GGSE ’01] • GRAdient Broadcast [YZLZ ’02] • Broadcast intermediate results along gradient back to source • Can dynamically control width of broadcast • Trade off fault tolerance and transmission costs • Our approach similar to GRAB: • Broadcast. Grab if upstream, ignore if downstream Common goal: try to get at least one copy to sink

  18. Simple Upstream Routing • By expanding ring search, nodes can compute their hop distance from the sink. • Refer to nodes at distance i as level i. • At level i, gather aggregates from level i+1. • Then broadcast aggregates to level i - 1 neighbors. • Ignore downstream and sidestream aggregates.

  19. Extending Flajolet / Martin Sketches • Also interested in approximating SUM • FM sketches can handle this (albeit clumsily): • To insert a value of 500, perform 500 distinct item insertions • Our observation: We can simulate a large number of insertions into an FM sketch more efficiently. • Sensor-net restrictions • No floating point operations • Must keep memory usage and CPU time to a minimum

  20. Simulating a set of insertions • Set all the low-order bits in the “safe” region. • First S = log c – 2 log log c bits are set to 1 w.h.p. • Statistically estimate number of trials going beyond “safe” region • Probability of a trial doing so is simply 2-S • Number of trials ~B(c,2-S). [Mean = O(log2c)] • For trials and bits outside “safe” region, set those bits manually. • Running time is O(1) for each outlying trial. Expected running time: O(log c) + time to draw from B(c,2-S) + O(log2c)

  21. Fast sampling from discrete pdf’s • We need to generate samples from B(n, p). • General problem: sampling from a discrete pdf. • Assume can draw uniformly at random from [0,1]. • With an event space of size N: • O(log N) lookups are immediate. • Represent the cdf in an array of size N. • Draw from [0, 1] and do binary search. • Cleverer methods for O(log log N), O(log* N) time Amazingly, this can be done in constant time!

  22. Constant Time Sampling • Theorem [Walker ’77]: For any discrete pdf D over a sample space of size N, a table of size O(N) can be constructed in O(N) time that enables random variables to be drawn from D using at most two table lookups.

  23. Sampling in O(1) time [Walker ’77] • Start with a discrete pdf. {0.40, 0.30, 0.15, 0.10, 0.05} • Construct a table of 2N entries. Algorithm Pick a column at random. Pick x uniformly from [0, 1]. If x < pi output i. Else output Qi A B C D E pi 1 0.25 1 0.75 0.5 __ __ Qi A B A In table above: Pr[B] = 1 * 0.2 + 0.5 * 0.2 = 0.3 Pr[C] = 0.75 * 0.2 = 0.15

  24. Methods of [Walker ’77] (cont.) • Ok, but how do you construct the table? Table construction Take “below-average” i. Choose pi to satisfy xi = pi /n. Set j with largest xj as Qi Reduce xjaccordingly. Repeat. A B C D E 0.15 0.10 0.05 0.40 0.30 0.20 0 0.25 0 0 0.20 A B C D E pi 1 0.25 1 0.75 0.5 __ __ Qi A B A Linear time construction.

  25. Back to extending FM sketches • We need to sample from B(c, 2-S) for various values of S. • Using Walker’s method, we can sample from B(c, 2-S) in O(1) time and O(c) space, assuming tables are pre-computed offline.

  26. Back to extending FM sketches (cont) • With more cleverness, we can trade off space for time. Recall that, • Running time = time to sample from B + O(log2c) • Sampling in O(log2c) time leads to O(c / log2c) space. • With max sensor value of 216, saving a log2c term is a 256-fold space savings. • Tables for S = 1, 2,…, 16 together take 4600 bytes (without this optimization, tables would be >1MB)

  27. Intermission • FM sketches require more work initially. • Need k bits to represent a single bit! • But: • Sketched values can easily be aggregated. • Aggregation operation (OR) is both order-insensitive and duplicate-insensitive. • Result is a natural fit with sensor aggregation.

  28. Outline • Sensor database motivation • COUNT aggregation via Flajolet-Martin • SUM aggregation • Experimental evaluation

  29. Experimental Results • We employ the publicly available TAG simulator. • Basic topologies: grid (2-D lattice) and random • Can modulate: • Grid size [default: 30 by 30] • Node, packet, or link loss rate [default: 5% link loss rate] • Number of bitmaps [default: twenty 16-bit sketches]. • Transmission radius [default: 8 neighbors on the grid]

  30. Experimental Results • We consider four main methods. • TAG: transmit aggregates up a single tree • DAG-based TAG: Send a 1/k fraction of the aggregated values to each of k parents. • SKETCH: broadcast an aggregated sketch to all neighbors at level i –1 • LIST: explicitly enumerate all <key, value> pairs and broadcast to all neighbors at level i – 1. • LIST vs. SKETCH measures the penalty associated with approximate values.

  31. Message Comparison • TAG: transmit aggregates up a single tree • 1 message transmitted per node. • 1 message received per node (on average). • Message size: 16 bits. • SKETCH: broadcast a sketch up the tree • 1 message transmitted per node. • Fanout of k receivers per transmission (constant k). • Message size: 20 16-bit sketches = 320 bits.

  32. COUNT vs Link Loss (Grid)

  33. COUNT vs Link Loss (Grid)

  34. COUNT vs Network Diameter (Grid)

  35. COUNT vs Link Loss (Random)

  36. SUM vs Link Loss

  37. Compressability • The FM sketches are amenable to compression. • We employ a very basic method: • Run length encode initial prefix of ones. • Run length encode suffix of zeroes. • Represent the middle explicitly. • Method can be applied to a group of sketches. • This alone buys about a factor of 3. • Better methods exist.

  38. Compression

  39. Space Usage

  40. Future Directions • Spatio-temporal queries • Restrict queries to specific regions of space, time, or space-time. • Other aggregates • What else needs to be computed or approximated? • Better aggregation methods • FM sketches have rather high variance. • Many other sketching methods can potentially be used.

More Related