Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)

Outline • Background and Preliminaries • Probabilistic Threshold Range Aggregate Query • Exact query processing • Approximate query processing: Simple Sampling & Double Sampling • Experiments • Conclusion DB@UNSW

Applications • Many applications involve data that is imperfect due to • data randomness and incompleteness • limitation of equipment • delay or lose in data transfer • … … • Applications • Sensor networks • Environmental surveillance • Moving objects • Data cleaning and integration • … … DB@UNSW

Applications • Sensor Networks: • Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08) DB@UNSW

Applications • Mobile Equipments / Moving Objects • A mobile object reports its location periodically, the exact location is often uncertain. DB@UNSW

Applications • Satellite data DB@UNSW

Applications • Data Quality • Social Data Collection: Errors and estimation inherent in customer surveys and sampling DBG @ UNSW

Outline • Background and Preliminaries • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Query • Conclusion DB@UNSW

Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Continuous case: described using a probability density function (PDF) fU such that . E.g., uniform distribution, normal distribution. DB@UNSW

Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Discrete case : described using a set of instances each instance u has an occurrence probability pu DB@UNSW

Possible World Semantics • Given a set of uncertain objects {U1,U2, ..., Un}, a possible worldW = {u1,u2, .., un} is a set of n instances --- one instance per uncertain object • The probability of a possible worlds is P(W) = • Let Ω be the set of all possible world, clearly, DB@UNSW

Probabilistic Queries: • Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07] • Aggregate Queries[BDJR05, MJ07, CG07] • Join Queries [CSP06, AW07] • Top-k queries [SIC07, YLSK08, RDS07, HJZL08] • Nearest Neighbor Queries [KKR07, CCMC08] • Skyline Queries[PJLY07] • … … DB@UNSW

Range query • Uncertain objects, exact query • Probability threshold is often assigned DBG @ UNSW

Related Work • Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t. Appearance probability DB@UNSW

U-tree Probabilistically Constrained Region ( PCR ) [TCXNKP05] PCR (0.2) Multi PCRs DB@UNSW

Outline • Introduction • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Aggregate Query (PTRA) • Conclusion DB@UNSW

Contribution • Formally define PTRA query • aU-Tree structure for exact PTRA query • singleSample and doubleSample techniques for approximate answer. DB@UNSW

Problem Statement Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq DB@UNSW

Problem Definition Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b) DB@UNSW

Exact Query Processing ( aU-Tree) • Main idea: add aggregate information on U-tree • Advantage: stop at intermediate level if pruned or fully covered by the query • Disadvantage: otherwise, still need to drill down to the leaf nodes. • For a large portion of uncertain objects, appearance probability needs to be computed • Expensive for a massive number of instances per object! DB@UNSW

Exact Query Processing ( aU-Tree) DB@UNSW

singleSample • Sampling the instances of the uncertain objects. • If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m DB@UNSW

singleSample ( cont. ) An immediate application of Chernoff-Hoeffding bound DB@UNSW

doubleSample • Single Sampling is expensive when there is a massive number of objects! • Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects. DB@UNSW

doubleSample: Accuracy • Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. • Using Chernoff-Hoeffding bound. DB@UNSW

doubleSample: Our Approach • Skew! • Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) • The optimization problem is NP-hard. • Observation: • Min-skew is a good heuristic to conduct such a group. • aU-tree groups objects with a similar principle to the min-skew. DB@UNSW

doubleSample: Our Approach • Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! • Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. • Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) • Step 2: sample objects in each subtree. • Step 3. sample instances in each sampled object. DB@UNSW

Experiments Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K DB@UNSW

Efficiency DB@UNSW

Accuracy DB@UNSW

Accuracy ( cont. ) DB@UNSW

Conclusion • Definition of PTRA • aU-Tree technique • Sampling technique • Future work. Any approach with theoretic guarantee? DB@UNSW

Thanks DB@UNSW

Min-Skew technique DB@UNSW

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Presentation Transcript

Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data

Top-k Query Processing in Uncertain Database

Probabilistic/Uncertain Data Management -- III

Probabilistic/Uncertain Data Management

Evaluating Probabilistic Queries over Uncertain Matching

Optimizing Probabilistic Query Processing on Continuous Uncertain Data Liping Peng Yanlei Diao Anna Liu VLDB 2011 Seattl

Probabilistic/Uncertain Data Management -- IV

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Model-Based Query Processing Over Uncertain Data (in ICDE 2011)

Aggregate Query Answering under Uncertain Schema Mappings

Multi-Dimensional Range Query over Encrypted Data

Probabilistic Queries and Uncertain Data

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

Multi-Dimensional Range Query over Encrypted Data

Multi-Dimensional Range Query over Encrypted Data

COMP9315 Uncertain and Probabilistic Data

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Probabilistic Reasoning with Uncertain Data

Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data

Probabilistic Reasoning with Uncertain Data

Multi-Dimensional Range Query over Encrypted Data