340 likes | 528 Views
SPARK: Top- k Keyword Query in Relational Database. Wei Wang University of New South Wales Australia. Outline. Demo & Introduction Ranking Query Evaluation Conclusions. Demo. Demo …. SPARK I. Searching, Probing & Ranking Top-k Results
E N D
SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia
Outline • Demo & Introduction • Ranking • Query Evaluation • Conclusions
SPARK I • Searching, Probing & Ranking Top-k Results • Thesis project (2004 – 2005) with Nino Svonja • Taste of Research Summary Scholarship (2005) • Finally, CISRA prize winner • http://www.computing.unsw.edu.au/softwareengineering.php
SPARK II • Continued as a research project with PhD student Yi Luo • 2005 – 2006 • SIGMOD 2007 paper • Still under active development
A Motivating Example … • Top-3 results in our system
Improving the Effectiveness • Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score. • the completeness factor. • the size normalization factor.
Preliminaries • Data Model • Relation-based • Query Model • Joined tuple trees (JTTs) • Sophisticated ranking • address one flaw in previous approaches • unify AND and OR semantics • alternative size normalization
Virtual Document • Combine tf contributions before tf normalization / attenuation.
Virtual Document Collection • Collection: 3 results • idfnetvista = ln(4/3) • idfmaxtor = ln(4/2) • Estimate idf: • idfnetvista = • idfmaxtor = • Estimate avdl = avdlC + avdlP
Completeness Factor L2 distance • For “short queries” • User prefer results matching more keywords • Derive completeness factor based on extended Boolean model • Measure Lp distance to the ideal position netvista Ideal Pos d = 1 (1,1) (c2 p2) d = 0.5 (c1 p1) d = 1.41 maxtor
Size Normalization • Results in large CNs tend to have more matches to the keywords • Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|) • Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well
Putting ‘em Together • score(JTT) = scorea * scoreb * scorec • a: IR-score of the virtual document • b: completeness factor • c: size normalization factor
Comparing Top-1 Results • DBLP; Query = “nikos clique”
#Rel and R-Rank Results • DBLP; 18 queries; Union of top-20 results • Mondial; 35 queries; Union of top-20 results
Query Processing 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes)
Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN)
Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN) • Execute the CNs • Most algorithms differ here. • The key is how to optimize for top-k retrieval
Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 c1 p1 c1 p1 P2 < < C2 C1 C c2 p2 c2 p2 DISCOVER2
Non-Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P2 P1 P ? c1 p1 c1 p1 < < ? C C1 c2 p2 c2 p2 C2 SPARK • Re-establish the early stopping criterion • Check candidates in an optimal order
Upper Bounding Function • Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function • Details • sumidf = widfw • watf(t) = (1/sumidf) * w(tfw(t) * idfw) • A = sumidf * (1 + ln(1 + ln( twatf(t) ))) • B = sumidf * twatf(t) • then, scorea uscorea = (1/(1-s))*min(A, B) monotonic wrt. watf(t) scoreb scoreuscore are constants given the CN scorec
Early Stopping Criterion Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 score( ) uscore( ) score( ) uscore( ) stop! P2 C2 C1 C SPARK • Re-establish the early stopping criterion • Check candidates in an optimal order
{P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. • Score(Pi Cj) = Score(Pi) + Score(Cj) Query Processing … • Execute the CNs Operations: CN: PQ CQ • [P1 ,P1] [C1 ,C1] • C.get_next() • [P1 ,P1] C2 • P.get_next() • P2 [C1 ,C2] • P.get_next() • P3 [C1 ,C2] • … // a parametric SQL query is sent to the dbms P P3 P2 P1 C1 C2 C3 C [VLDB 03]
Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) and uscore(<Pi, Cj>) > uscore(<Pi, Cj+1>) Skyline Sweeping Algorithm • Execute the CNs Priority Queue: Operations: CN: PQ CQ • <P1 , C1 > • <P2 , C1 >, <P1 , C2 > • <P3 , C1 >, <P1 , C2 >, <P2 , C2 > • <P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 > • … P • P1C1 • P2C1 • P3C1 P3 P2 P1 C1 C2 C3 C Skyline Sweep • Re-establish the early stopping criterion • Check candidates in an optimal order sort of
Block Pipeline Algorithm • Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions • draw an example • Lots of candidates with high uscores return much lower (real) score • unnecessary (expensive) checking • cannot stop earlier • Idea • Partition the space (into blocks) and derive tighter upper bounds for each partitions • “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)
Block Pipeline Algorithm … Execute a CN Assume: idfn> idfmand k = 1 CN: PQ CQ P (n:0, m:1) (n:1, m:0) 2.74 2.41 2.38 2.63 2.63 stop! 2.63 2.63 1.05 C (n:1, m:0) (n:0, m:1) 1.05 Block Pipeline • Re-establish the early stopping criterion • Check candidates in an optimal order
Efficiency • DBLP • ~ 0.9M tuples in total • k = 10 • PC 1.8G, 512M
Efficiency … • DBLP, DQ13
Conclusions • A system that can perform effective & efficient keyword search on relational databases • Meaningful query results with appropriate rankings • second-level response time for ~10M tuple DB (imdb data) on a commodity PC
Q&A Thank you.