400 likes | 477 Views
ICS280 Presentation by Suraj Nagasrinivasa. (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar (2) Model-Driven Data Acquisition in Sensor Networks (VLDB 2004) by A Deshpande, C Guestrin, J Hellerstein, W Hong, S Madden
E N D
ICS280 Presentationby Suraj Nagasrinivasa (1) Evaluating Probabilistic Queries over Imprecise Data (SIGMOD 2003) by R Cheng, D Kalashnikov, S Prabhakar (2) Model-Driven Data Acquisition in Sensor Networks (VLDB 2004) by A Deshpande, C Guestrin, J Hellerstein, W Hong, S Madden Acknowledgements: Dmitri Kalashnikov and Michal Kapalka
In typical sensor applications... • Sensors monitor external environment continuously • Sensor readings are sent back to the application • Decisions are often made based on these readings
However, we face uncertainty… • Typically, DB/server collects sensor readings • DB cannot store “true” sensor value at all points in time • Scarce battery power • Limited network bandwidth • So, readings recorded at discrete time points • Value of phenomenon continuously changing • As a result, DB stored reading is mostly obsolete
Scenario: Answering Minimum Query with discrete DB stored readings Recorded Temperature Current Temperature x1 y0 • x0 < y0: x is minimum • y1 < x1: y is minimum • Wrong query result x0 y1 x y
Scenario: Answering Minimum Query with error-bound readings I Recorded Temperature Bound for Current Temperature y0 • x certainly gives the minimum temperature reading x0 x y
Scenario: Answering Minimum Query with error-bound readings II Recorded Temperature Bound for Current Temperature y0 • Both x and y have a chance of yielding the minimum value • Which one has a higher probability? x0 x y
Probabilistic Queries • Based on variation characteristics of sensor value over time: • Bounds can be estimated for possible values • Probability distribution of values defined within bounds • Evaluate probability for query answers • Probabilistic queries give a correct answer, instead of a potentially incorrect answer
Rest of the paper… • Notation & Uncertainty Model • Classification of Probabilistic Queries • Evaluating Probabilistic Queries • Quality of Probabilistic Queries • Object Refreshment Policies • Experimental Results
Notation • T : A set of DB objects (e.g. sensors) • a : Dynamic attribute (e.g. pressure) • Ti : ith object of T • Ti.a(t) : Value of ‘a’ in ‘Ti’ at time ‘t’
Uncertainty Model fi(x,t) – uncertainty pdf Ti.a(t) [li(t) ui(t)] Uncertainty Interval Ui(t) • Can be extended in ‘n’ dimensions
Classification of Probabilistic Queries • Type of Result • Value-based: returns single value • E.g. Minimum query ([l,u], pdf) • Entity-based: returns set of objects • E.g. Range query ({(Ti, pi), pi>0}) • Aggregation • Non-Aggregate: query result for an object is independent of other objects • E.g. Range query • Aggregate: query result computed from set of objects • E.g. Nearest Neighbor query
Classification of Probabilistic Queries • Query evaluation algorithms and quality metrics are developed for each class
"Is reading of sensor i in range [l,u] ?" Quality of Probabilistic Result • Introduce a notion of “quality of answer” • Proposed metrics for different classes of queries • regular range query • "yes" or "no" with 100% • probabilistic query ERQ • yes with pi = 95%: OK • yes with pi = 5%: OK (95% it is not in [l, u]) • yes with pi = 50%: NOT OK (not certain!)
Quality for Entity-Aggregate Queries "Which sensor, among n, has the minimum reading?" • Recall • Result set R = {(Ti, pi)} • e.g. {(T1, 30%), (T2, 40%), (T3, 30%)} • B is interval, bounding all possible values • e.g. minimum is somewhere in B = [10,20] • Our metrics for aggregate queries Min, Max, NN • objects cannot be treated independently as in ERQ metric • uniform distribution (in result set) is the worst case • metrics are based on entropy
Quality for Entity-Aggregate Queries • H(X) entropy of random variable X (X1 ,…,Xn with p(X1) ,…, p(Xn)) • entropy is smallest (i.e., 0) iff i : p(Xi) = 1 • entropy is largest (i.e., log2(n)) iff all Xi's are equally likely
Improving Answer Quality • Is important to pick right update policies that will help improve answer quality • Global Choice • Glb_RR (pick random) • Local Choice • Loc_RR (pick random) • MaxUnc (heuristic chooses max. uncertainty interval ) • MinExpEntropy (heuristic choose object with minimum expected entropy)
Experiments: Simulation Set-up • 1 server, 1000 sensors, limited network bandwidth, “Min” queries tested • Queries arrival is a Poisson distribution • Each query over a random set of 100 sensors
Conclusions • Probabilistic Querying for handling inherent uncertainty in sensor DBs • Classification, Algorithms and Quality of Answer metrics for various query types • Very general model of uncertainty which makes the algorithms not directly implement-able in any sensor network • Besides, in order to achieve any reasonable energy-efficiency in sensor networks, application and network requirements that dictate sensor nodes to be awake have to be tightly coordinated. Especially in the case of multi-hop routing
Outline for ‘Model Driven Data Acquisition for Sensor Networks’ • Introduction • Motivation for Model-Based Queries • Framework Concept • Model Example – Multivariate Gaussian • Algorithm • Resolving Model-Based Queries • Incorporating Dynamicity • Observation Plan / Cost model • Experiments • BBQ System • Results • Conclusions
Motivation for Model-Based Queries • Declarative Queries adopted as key programming paradigm for large sensor nets • However, interpreting sensor nets as databases results in two major problems: • Misinterpretation of Data • Physically observable world is a set of continuous phenomenon in both time and space • Sensor readings are UNLIKELY to be random samples • Inefficient approximate queries • If sensor readings are not “true” values, need for quantifying uncertainty to provide reliable answers
Motivation for Model-Based Queries • Paper Contribution: To incorporate statistical models of real-world processes into sensor net query processing architecture • Models help in: • Accounting for biases in spatial sampling • Identifying sensors providing faulty data • Extrapolating values for missing sensors
Framework Concept • Goal: Given a query and model, to devise an efficient data acquisition plan to provide “best” possible answer • Major dependencies: • Correlations between sensors captured by the statistical model • Correlation between attributes for given sensor • Correlation between sensors for given attribute • Specific connectivity of the wireless network
Framework ConceptObservation Plan parameters * Correlations in Value * Cost Differential
Resolving Model-Based Queries(Value Queries) • To compute value of Xi with maximum error ‘e’ and confidence ‘1-delta’: • Compute mean of Xi (where o – observations) • As in range queries, find probability :
Range Queries for Gaussian • Projection for Gaussian is simple – just drop unnecessary values from mean and variance matrix • The integral has to be computed.
Incorporating Dynamicity • Use historical measurements to improve confidence of answers • Given pdf in time ‘t’ • Compute pdf at time ‘t+1’
Incorporating Dynamicity • Assumption: Markovian Model • Dynamicity summarized by “transition model”
Observation Plan / Cost Model • What is the cost of making ‘o’ observations? • C(o) = acquisition cost + transmission cost • Acquisition cost: constant for each attribute • Transmission cost: • Network graph • Edge weights (link quality) • Paths taken could be sub-optimal
Observation Plan / Cost Model • A set of attributes (‘theta’) to observe are determined by computing expected benefit And finding… This, being similar to the traveling salesman’s problem, is best dealt with heuristic algorithms
BBQ System • BBQ: A Tiny-Model Query System • Uses Multivariate Gaussians • Has 24 transition models – for different hour of day
Results • Experiment: 11 sensors on a tree, 83000 measurements, 2/3 used for training and 1/3 for tests • Methodology • BBQ builds a model based on training data • One random query / hour taken – possible observations and model is updated • The answer is compared to the measured value • Compare with two other methods • TinyDB: Each query broadcasted over sensor networks using an overlay tree • Approximate-Caching: Base station maintains a view of the sensor readings
Conclusion • Approximate queries can be well optimized, but model of physical phenomenon is needed • Defining an appropriate model is a challenge • The framework works well for “fairly steady” sensor data values • Statistical model is largely static with refinements to the model based on incoming queries and observations made as a result