340 likes | 503 Views
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data. Wenjie Zhang University of New South Wales & NICTA, Australia. Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA). Outline. Background and Preliminaries Probabilistic Threshold Range Aggregate Query
E N D
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
Outline • Background and Preliminaries • Probabilistic Threshold Range Aggregate Query • Exact query processing • Approximate query processing: Simple Sampling & Double Sampling • Experiments • Conclusion DB@UNSW
Applications • Many applications involve data that is imperfect due to • data randomness and incompleteness • limitation of equipment • delay or lose in data transfer • … … • Applications • Sensor networks • Environmental surveillance • Moving objects • Data cleaning and integration • … … DB@UNSW
Applications • Sensor Networks: • Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08) DB@UNSW
Applications • Mobile Equipments / Moving Objects • A mobile object reports its location periodically, the exact location is often uncertain. DB@UNSW
Applications • Satellite data DB@UNSW
Applications • Data Quality • Social Data Collection: Errors and estimation inherent in customer surveys and sampling DBG @ UNSW
Outline • Background and Preliminaries • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Query • Conclusion DB@UNSW
Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Continuous case: described using a probability density function (PDF) fU such that . E.g., uniform distribution, normal distribution. DB@UNSW
Modeling Uncertainty ( cont. ) • Uncertain Objects Model • Discrete case : described using a set of instances each instance u has an occurrence probability pu DB@UNSW
Possible World Semantics • Given a set of uncertain objects {U1,U2, ..., Un}, a possible worldW = {u1,u2, .., un} is a set of n instances --- one instance per uncertain object • The probability of a possible worlds is P(W) = • Let Ω be the set of all possible world, clearly, DB@UNSW
Probabilistic Queries: • Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07] • Aggregate Queries[BDJR05, MJ07, CG07] • Join Queries [CSP06, AW07] • Top-k queries [SIC07, YLSK08, RDS07, HJZL08] • Nearest Neighbor Queries [KKR07, CCMC08] • Skyline Queries[PJLY07] • … … DB@UNSW
Range query • Uncertain objects, exact query • Probability threshold is often assigned DBG @ UNSW
Related Work • Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t. Appearance probability DB@UNSW
U-tree Probabilistically Constrained Region ( PCR ) [TCXNKP05] PCR (0.2) Multi PCRs DB@UNSW
Outline • Introduction • Modeling Uncertainty & Related Work • Probabilistic Threshold Range Aggregate Query (PTRA) • Conclusion DB@UNSW
Contribution • Formally define PTRA query • aU-Tree structure for exact PTRA query • singleSample and doubleSample techniques for approximate answer. DB@UNSW
Problem Statement Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq DB@UNSW
Problem Definition Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b) DB@UNSW
Exact Query Processing ( aU-Tree) • Main idea: add aggregate information on U-tree • Advantage: stop at intermediate level if pruned or fully covered by the query • Disadvantage: otherwise, still need to drill down to the leaf nodes. • For a large portion of uncertain objects, appearance probability needs to be computed • Expensive for a massive number of instances per object! DB@UNSW
singleSample • Sampling the instances of the uncertain objects. • If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m DB@UNSW
singleSample ( cont. ) An immediate application of Chernoff-Hoeffding bound DB@UNSW
doubleSample • Single Sampling is expensive when there is a massive number of objects! • Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects. DB@UNSW
doubleSample: Accuracy • Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. • Using Chernoff-Hoeffding bound. DB@UNSW
doubleSample: Our Approach • Skew! • Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) • The optimization problem is NP-hard. • Observation: • Min-skew is a good heuristic to conduct such a group. • aU-tree groups objects with a similar principle to the min-skew. DB@UNSW
doubleSample: Our Approach • Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! • Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. • Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) • Step 2: sample objects in each subtree. • Step 3. sample instances in each sampled object. DB@UNSW
Experiments Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size 8K DB@UNSW
Efficiency DB@UNSW
Accuracy DB@UNSW
Accuracy ( cont. ) DB@UNSW
Conclusion • Definition of PTRA • aU-Tree technique • Sampling technique • Future work. Any approach with theoretic guarantee? DB@UNSW
Thanks DB@UNSW
Min-Skew technique DB@UNSW