390 likes | 529 Views
Similarity Joins for Strings and Sets. William Cohen. WHIRL approach:. SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b. SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar). Query Q. WHIRL approach:. Link items as needed by Q.
E N D
Similarity Joins for Strings and Sets William Cohen
WHIRL approach: SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b
SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar) Query Q WHIRL approach: Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.
WHIRL queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing
WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) • ““Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005”and the review text is similar to “sci-fi comedy”)
Years are common in the review archive, so have low weight WHIRL queries • Similarity is based on TFIDF rare wordsare most important. • Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune“unimportant terms”
A* (best-first) search • Find shortest path between start n0and goal ng:goal(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) = estimate of path length from n to ng • Algorithm: • OPEN= {n0} • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if goal(n), output path n0n and stop • otherwise, add CHILDREN(n) to OPEN • and record their MinPathLengthparents
empty circles = open set, filled = closed set; color = distance from the start (the greener, the further)
A* (best-first) search • Find shortest path between start n0and goal ng:goal(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) =lower-bound of path length from n to ng • Algorithm: • OPEN= {n0} • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if goal(n), output path n0n and stop • otherwise, add CHILDREN(n) to OPEN • and record their MinPathLengthparents • …note this is easy for a tree h is “admissible” and A* will always return the lowest-cost path
A* (best-first) search for best K paths • Find shortest path between start n0and goal ng:goal(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) =lower-bound of path length from n to ng • Algorithm: • OPEN= {n0} • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if goal(n), output path n0n • and stop if you’ve output K answers • otherwise, add CHILDREN(n) to OPEN • and record their MinPathLengthparents • …note this is easy for a tree h is “admissible” and A* will always return the K lowest-cost paths
“Best-first” search: pick state s that is “best” according to f(s) Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best, ... Inference in WHIRL
Using A* For WHIRL Queries • Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing
A* search to solve WHIRL queries • “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ • ““Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005”and the review text is similar to “sci-fi comedy”) review(Title,Text), Text ~ “sci fi comedy” Answer to Q is an assignment Θ to all the variables in Q review(TitleA,Text), listing(TitleB,Where,When), Text ~ “sci fi comedy”, TitleA ~ TitleB
Using A* For WHIRL Queries review(TitleA,Text), listing(Where, TitleB,When), Text ~ “sci fi comedy”, TitleA ~ TitleB
Using A* For WHIRL Queries review(TitleA,Text), listing(TitleB,Where,When), Text ~ “sci fi comedy”, TitleA ~ TitleB
Using A* For WHIRL Queries review(TitleA,Text), listing(TitleB,Where,When), Text ~ “sci fi comedy”, TitleA ~ TitleB
Using A* For WHIRL Queries review(TitleA,Text), listing(TitleB,Where,When), Text ~ “sci fi comedy”, TitleA ~ TitleB
A* (best-first) search for best K paths • Find shortest path between start n0and goal ng:goal(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) =lower-bound of path length from n to ng • Algorithm: • OPEN= {n0 }, where n0is an empty assignment to variables • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if goal(n), output path n0n • and stop if you’ve output K answers • otherwise, add CHILDREN(n) to OPEN • where CHILDREN(n) binds a few more variables
A* (best-first) search for best K paths • Find shortest path between start n0and goal ng:allVarsBound(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) = lower-bound of path length from n to ng • Algorithm: • OPEN= {n0 }, where n0is an empty assignment to variables • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if allVarsBound(n), output path n0n • and stop if you’ve output K answers • otherwise, add CHILDREN(n) to OPEN • where CHILDREN(n) binds a few more variables
A* (best-first) search for best K paths • Find shortest path between start n0and goal ng:allVarsBound(ng) • Define f(n) = g(n) + h(n) • g(n) = MinPathLength(n0,n)| • h(n) = lower-bound of path length from n to ng • Algorithm: • OPEN= {n0 }, where n0is an empty assignment θ andan empty “exclusion list” E • While OPEN is not empty: • remove “best” (minimal f) node n from OPEN • if allVarsBound(n), output path n0n • and stop if you’ve output K answers • otherwise, add CHILDREN(n) to OPEN • where CHILDREN(n) binds a few more variables
Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, find DB column C to which Y should be bound pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one. Inference in WHIRL
Inference in WHIRL • Adding to exclusions E means that upper bound h decreases • Looking at maxweight means that partial assignments that can’t match well are penalized
Aside: Why WHIRL’s query language was cool • Combination of cascading queries and getting top-k answers is very useful • Highly selective queries: • system can apply lots of constraints • user can pick from a small set of well-constrained potential answers • Very broad queries: • system can cherry-pick and get the easiest/most obvious answers • most of what the user sees is correct • Similar to joint inference schemes • Can handle lots of problems • classification,
Epilogue • A few followup query systems to WHIRL • ELIXIR, iSPARQL, … • “Joint inference” trick mostly ignored • and/or rediscovered over and over • Lots and lots of work on similarity/distance metrics and efficient similarity joins • much of which rediscovers A*-like tricks
Outline • Why joins are important • Why similarity joins are important • Useful similarity metrics for sets and strings • Fast methods for K-NN and similarity joins • Blocking • Indexing • Short-cut algorithms • Parallel implementation