450 likes | 535 Views
An overview of the Mystiq System. Christopher Ré , Dan Suciu and the Mystiq Team University of Washington. One slide overview. Data are uncertain in many applications Business: Dedup , Info. Extraction Data from physical-world: RFID. Probabilistic DBs ( pDBs ) manage uncertainty.
E N D
An overview of the Mystiq System Christopher Ré, Dan Suciu and the Mystiq Team University of Washington
One slide overview • Data are uncertain in many applications • Business: Dedup, Info. Extraction • Data from physical-world: RFID Probabilistic DBs (pDBs) manage uncertainty Query and Build Applications on uncertain data Value: Higher recall, without loss of precision This talk: An overview of Mystiq DEMO
Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SELECT-FROM-WHERE (SFW) queries 2. Safe Queries 3. Materialized Views
[R,Dalvi&S’07] Example: Alice Looks for Movies I’d like to know whichmovies are really good… • Internet Movie Database (IMDB): • Lots of data ! • Well maintained and clean • But no reviews! Think: Enterprise Data IMDB
What is the title of the movie in the review? On the web thereare lots of reviews… Which movie does that title match in my DB? Is the reviewpositive or negative ? …Pulp Fiction was a great.. …Pul Fiction was awful … Should I trustthe reviewer ? • Alice may need (Buzzwords): • Information Extraction • Fuzzy joins • Sentiment analysis • Social networks IMDB Alice is forced to deal with uncertainty
Find actors in Pulp Fiction whoappeared in two bad moviesfive years earlier Answer combines uncertainty from information extraction, fuzzy joins, etc. A probabilistic database helps Alice store and query uncertain data • Alice may need (Buzzwords): • Information Extraction • Fuzzy joins • Sentiment analysis • Social networks IMDB
Alice needs Fuzzy Joins Titles don’t match Clean database IMDB Reviews
[Gravanoet al’01,Arasu’06] Result of a Fuzzy Join Higher scores, more likely to match TitleReviewMatchp
Queries over Fuzzy Joins IMDB TitleReviewMatchp Reviews Ranked ! Answer: Who reviewed movies made in 1935 ? SELECT DISTINCT z.ByFROM IMDB x, TitleReviewMatchp y, Amazon zWHERE x.title=y.title and x.year=1935 and y.review=z.review Find movies reviewed by Jim and Joe Answer: SELECT DISTINCT x.Title FROM IMDB x, TitleReviewMatchp y1, Amazon z1, TitleReviewMatchp y2, Amazon z2 WHERE . . .z1.By=‘Joe’ . . . . z2.By=‘Jim’ . . .
Hasn’t this been solved? (an analogy to keep in mind) SCALE Impact: Fortune 500 companies rely on DBs, but how many have theorem provers?
Mystiq Design Goals: scale. • Middleware/Query rewriting system. • RDBMS does heavy lifting. • In apps, lots of certain data. • Research Focus: Efficient query evaluation • Philosophy: Change as little as possible. • Restricted inference at large scales • Use DB tricks: static analysis, data complexity, materialized views.
Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views
[Barbara et al. ‘92] Mystiq’s BID tables Probability Keys Non-keys HasObjectp What does it mean ? NB: Probabilities need not add to 1
[Fagin,Halpern,Megido’90] Possible Worlds Semantics HasObjectp Distribution over possible worlds PDB HasObject 3 * 4 = 12 Worlds Possibleworlds p1p3 p1p4 p1(1- p3-p4-p5)
Possible Worlds Query Semantics HasObjectp PDB HasObject Q=“John has laptop77 and doesn’t have book302” p1p3 P[Q]= p1(1-p4) p1p5 QP Goal: Compute w.o. expanding all worlds p1(1- p3-p4-p5)
Outline • Motivation • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views
[Fuhr&Roellke’97, Graedel et al. ’98, Dalvi & S ‘04] SFW Query via IntensionalEval Goal: Make relational ops compute Boolean expression f Pr[q] reduced to Pr[fis SAT]. Duplicate removing P s JOIN Approx Pr[f is SAT] NB: f is also known as lineage Tuples = variables in expression
Approximating Tuple answers Q=“Find actors in Pulp Fiction who appeared in two bad movies five years earlier” • SQL queries have provably fast apx-inference (LK) 0.0 p 1.0 Christopher Walken Don’t know prob (p) that ‘C. Walken’ is in output of Q Run many “simulations” to reduce uncertainty
[R,Dalvi&S’07] Motivation for Top-K for SFW queries Naïve: Run LK, make all small • LK fast in theory… “Find the top (most-likely) actor in Pulp Fiction who appeared in two bad movies five years earlier” Lots of wasted effort. Can we do better? 0.0 1.0 1 4 Christopher Walken 2 Samuel L. Jackson 3 Harvey Keitel Bruce Willis “Confidence intervals” contain true probability
[R,Dalvi&S’07] A Better Method: Multisimulation • Goal: Separate Top-K with few simulations • LK is more expensive than SQL, reduce this cost • Ranking is all that is important • Intuition: • Concentrate LK on intervals in Top-K • View intervals as “nested” or “shrinking” 12/8/2006 Evaluating Complex SQL on PDBs 20
[R,Dalvi&S’07] Key Idea: Critical Region • The critical region is the interval • (kth-highest min, k+1st higest max) • For k = 2 0.0 1.0
[R,Dalvi&S’07] Key Idea: Critical Region • The critical region is the interval • (kth-highest min, k+1sthigest max) • For k = 2 Separated the top 2 0.0 1.0
DEMO See how Mystiq uses the critical region to reduce unnecessary simulations.
Three Simple Rules: Rule 1 • Pick a “Double Crosser” • OPT must pick this too 0.0 1.0 Compare v. OPT: “knows” intervals to simulate
Three Simple Rules: Rule 2 • All lower/upper crossers then maximal • OPT must pick this too 0.0 1.0 Compare v. OPT: “knows” intervals to simulate
Three Simple Rules: Rule 3 • Pick an upper and a lower crosser • OPT may only pick 1 of these two 0.0 1.0 Compare v. OPT: “knows” intervals to simulate
[R,Dalvi&S’07] Multisimulation Performance • Thm: Multisimulation performs at most twice as many simulations as OPT • And, no deterministic algorithm can do better on every instance. • Practice: very slow w.o. low-level optimization • Still slow with current techniques. • Open question! • Slow v. SQL, not inference
Outline • Motivation/Type of Apps considerd • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views
[Fuhr&Roellke’97, Dalvi & S ‘04] Extensional Query Evaluation “Not all are false” Goal: Make relational ops compute probabilities Removes Duplicates P s JOIN Why? It’s SQL–scale and SQL-fast
[Fuhr&Roellke’97, Dalvi & S ‘04] Extensional Plan to SQL SELECT DISTINCT loc FROMReviewers P{loc} SELECTloc, 1 – PRODUCT(1-p) as p FROM Reviewers GROUP BY loc Translation Important point: Extensional Evaluation is SQL – so SQL fast So pDBs are just SQL, but… Reviewers
SELECT DISTINCT x.City FROM Personp x, Reviewedpy WHERE x.Name = y.Reviewer and y.Movie= ‘Iron Man’ “Cities where someone reviewed ‘Iron Man’ ” Wrong ! Not independent! Correct P JOIN JOIN P Depends on plan !!!
[Dalvi&S’04] Safe Plans • A plan that correctly computes probabilities (extensionally) is called a safe plan • Query Compilation = finding this condition • i.e., it isa syntactic condition • Intuition: A plan is safe if • it only multiplies independent probabilities.
DEMO See how safe plans allow query answering at SQL speed
[Dalvi&S’04] Thm: The algorithm is complete Data complexityis #P complete Qbad :- R(x), S(x,y), T(y) Bottomline: If there is a plan, we find it. If we don’t find a plan, it’s provably hard • Theorem The following are equivalent (no self-joins) • Q has PTIME data complexity • Q admits an extensional plan (and one finds it in PTIME) • Q does not have Qbad as a subquery NB: never looked at the data, so is query compilation
Outline • Motivation/Type of Apps considerd • Mystiq’sDatamodel • 3 Techniques used by Mystiq 1. Generic SFW queries 2. Safe Queries 3. Materialized Views
[R&S 07] Views in Block-based pDBs by example p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) {c→`Tom’, r→ `D. Lounge’, d→`Crab’} S(Restaurant,Dish) Serves 0.72 = 0.9 * 0.8
[R&S 07] Example coming… Eager Materialization of BID Views Idea: Throw away the lineage, process views • Why? • Lineage can be much larger than view • Can do expensive prob. computations off-line • Use view directly in safe-plan optimizer • Interleave Monte-Carlo Sampling with safe-plan • pDB analog of Materialized Views • Allows GB scale pDB processing • Catch:tuples in view independent for any instance.
[R&S 07] Eager Materialization of pDB Views p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) S(Restaurant,Dish) Serves Can we understand w.o. lineage? Not every probabilistic view is good for materialization!
[R&S 07] Is a view a good candidate for materialization? • Thm: Deciding if a view is representable as a BID is decidable & NP-Hard (Complete for P2) • Good News: Simple but cautious, PTIME test i.e., a sufficient condition In wild, simple test works, i.e., is necessary as well NB: Can take into account a query q, i.e., can we use V1 without the lineage to answer q?
Uses for Views • Precomputation • Can make #P hard query, safe • No magic, precompute the hard part • Intermediate Results • Approximate Views • All views have small lineage [R&S 08]
Conclusions • Discussed the Mystiq System • http://mystiq.cs.washington.edu • 3 strategies for processing: • Multisimulation • Safe Plans • Materialized Views • Allows interesting, large-scale applications
[R&S 07] Eager Materialization of pDB Views p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “chefs that serve a highly rated dish” V2(c) :- W(c,r),S(r,d),R(c,d,’High’) Obs: if no prob. tuple shared by two chefs, then they are independent S(Restaurant,Dish) Serves Can we understand w.o. lineage? Where could such a tuple live? V2 is a good choice for materialization
[R&S 07] Views in BID pDBs p1 q1 p2 q2 R(Chef,Dish,Rate) Rated W(Chef,Restaurant) WorksAt “Chef and restaurant pairs where chef serves a highly rated dish” V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) S(Restaurant,Dish) Serves View has correlations Thm[ R,Dalvi,S ’07] BID are complete with of views