330 likes | 439 Views
Top-K Query Evaluation on Probabilistic Data. Christopher Ré , Nilesh Dalvi and Dan Suciu University of Washington. High Level Overview. DBMS: Precise answers over clean data Data are often imprecise Information Integration Information Extraction
E N D
Top-K Query Evaluation on Probabilistic Data Christopher Ré, NileshDalvi and Dan Suciu University of Washington
High Level Overview • DBMS: Precise answers over clean data • Data are often imprecise • Information Integration • Information Extraction • Probabilistic DB (PDB) handle imprecision • Many low quality answers • Top-K ranked by probability This talk: Compute Top-K Efficiently Evaluating Complex SQL on PDBs
Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs
Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs
Example Application How will I know which movie they are about? Find all years where ‘Anthony Hopkins’ starred in a good movie On the web there are lots of reviews Is a movie good or bad? A probabilistic database can help Alice store and query her uncertain data. Alice needs to do information extraction and object reconcillation. • Lots of interesting data above movies (e.g. actors, directors) • Well maintained and clean • But no reviews! IMDB Alice wants to do sentiment analysis. Evaluating Complex SQL on PDBs
Imprecision is out there… Object Reconciliation Felligi-Sunter Approach: Score (s) each (RID,MID) Clean IMDB Data Our Approach: Convert scores to probabilities No Match Match Output: (RID,MID) pairs Data extracted from Reviews t’ t 12/8/2006 Evaluating Complex SQL on PDBs
Object Reconciliation Imprecision is out there… Felligi-Sunter Approach: Score (s) each (RID,MID) No Match Match t’ t Evaluating Complex SQL on PDBs
Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs
Query Processing Background • Intensional Query Processing [FR97] • Associate to each tuple an event • Probability event is satisfied = query value Technical Point: Projection as last operator implies result is a DNF Query Processing builds event expression Evaluating Complex SQL on PDBs
DNF Sampling at a High Level • Estimate p(t),probability DNF sat satisfied • Do for each output tuple, t • #P-Hard [Valiant79] even if only conjunctive queries [RDS06,DS04] • Randomized Approximation [LK84] Simulation reduces uncertainty 1.0 0.0 Uncertain about p(t) Evaluating Complex SQL on PDBs
Naïve Query Processing • Naïve algorithm (PTIME): Simulate until all small • “Epsilon”-small Can we do better? 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs
Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs
A Better Method: Multisimulation • Separate Top-K with few simulations • Concentrate on intervals in Top-K • Asymptotically, confidence intervals are nested • Compare against OPT • “knows” which intervals to simulate 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis 12/8/2006 Evaluating Complex SQL on PDBs Evaluating Complex SQL on PDBs 13
The Critical Region • The critical region is the interval • (kth-highest min, k+1sthigest max) • For k = 2 0.0 1.0 Evaluating Complex SQL on PDBs
Three Simple Rules: Rule 1 • Pick a “Double Crosser” • OPT must pick this too 0.0 1.0 Evaluating Complex SQL on PDBs
Three Simple Rules: Rule 2 • All lower/upper crossers then maximal • OPTmust pick this too 0.0 1.0 Evaluating Complex SQL on PDBs
Three Simple Rules: Rule 3 • Pick an upper and a lower crosser • OPTmay only pick 1 of these two 0.0 1.0 Evaluating Complex SQL on PDBs
Multisimulationis a 2-Approx • Thm: Multisimulation performs at most twice as many simulations as OPT • And, no deterministic algorithm can do better on every instance. • Extensions • Top-K Set (shown) • Anytime (produce from 1 to k) • Rank (produce top k ranked) • All ( rank all intervals ) Evaluating Complex SQL on PDBs
Overview • Motivating Example • Query Processing Background • Multisimulation • Experimental Results Evaluating Complex SQL on PDBs
Experiment Details: Uncertain tuples Evaluating Complex SQL on PDBs
Running Time Evaluating Complex SQL on PDBs
Running Time “Find all years in which Anthony Hopkins was in a highly rated movie” (SS) Small Number of Tuples Output (33) Small DNFs per Output (Avg. 20.4, Max 63) Evaluating Complex SQL on PDBs
Running Time “Find all directors who have a highly rated drama but low rated comedy” (LL) Large #Tuples Output (1415) Large DNFs per Output (Avg. 234.8, Max. 9088) Evaluating Complex SQL on PDBs
Conclusions • Mystiq is a general purpose probabilistic database • Multisimulationand Logical Optimization • key to performance on large data sets • Advert: Demo on my laptop Evaluating Complex SQL on PDBs
Running Time “Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL) Small Number of Tuples Output (33) Large DNFs per Output (Avg. 117.7,Max 685) Evaluating Complex SQL on PDBs
Running Time “Find all directors in the 80s who had a highly rated movie” (LS) Large #Tuples Output (3259) Small DNFs per Output (Avg 3.03, Max 30) Evaluating Complex SQL on PDBs
0.0 1.0 Christopher Walken Samuel L. Jackson Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs
0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis Evaluating Complex SQL on PDBs
0.0 1.0 Evaluating Complex SQL on PDBs
0.0 1.0 Evaluating Complex SQL on PDBs
0.0 1.0 Evaluating Complex SQL on PDBs
0.0 1.0 Evaluating Complex SQL on PDBs
0.0 1.0 Evaluating Complex SQL on PDBs