SPARK: Top- k Keyword Query in Relational Database

SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

Outline • Demo & Introduction • Ranking • Query Evaluation • Conclusions

Demo

Demo …

SPARK I • Searching, Probing & Ranking Top-k Results • Thesis project (2004 – 2005) with Nino Svonja • Taste of Research Summary Scholarship (2005) • Finally, CISRA prize winner • http://www.computing.unsw.edu.au/softwareengineering.php

SPARK II • Continued as a research project with PhD student Yi Luo • 2005 – 2006 • SIGMOD 2007 paper • Still under active development

A Motivating Example

A Motivating Example … • Top-3 results in our system

Improving the Effectiveness • Three factors are considered to contribute to the final score of a search result (joined tuple tree) • (modified) IR ranking score. • the completeness factor. • the size normalization factor.

Preliminaries • Data Model • Relation-based • Query Model • Joined tuple trees (JTTs) • Sophisticated ranking • address one flaw in previous approaches • unify AND and OR semantics • alternative size normalization

Problems with DISCOVER2

Virtual Document • Combine tf contributions before tf normalization / attenuation.

Virtual Document Collection • Collection: 3 results • idfnetvista = ln(4/3) • idfmaxtor = ln(4/2) • Estimate idf: • idfnetvista =  • idfmaxtor = • Estimate avdl = avdlC + avdlP

Completeness Factor L2 distance • For “short queries” • User prefer results matching more keywords • Derive completeness factor based on extended Boolean model • Measure Lp distance to the ideal position netvista Ideal Pos d = 1 (1,1) (c2  p2) d = 0.5 (c1  p1) d = 1.41 maxtor

Size Normalization • Results in large CNs tend to have more matches to the keywords • Scorec = (1+s1-s1*|CN|) * (1+s2-s2*|CNnf|) • Empirically, s1 = 0.15, s2 = 1 / (|Q| + 1) works well

Putting ‘em Together • score(JTT) = scorea * scoreb * scorec • a: IR-score of the virtual document • b: completeness factor • c: size normalization factor

Comparing Top-1 Results • DBLP; Query = “nikos clique”

#Rel and R-Rank Results • DBLP; 18 queries; Union of top-20 results • Mondial; 35 queries; Union of top-20 results

Query Processing 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes)

Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN)

Query Processing … 3 Steps • Generate candidate tuples in every relation in the schema (using full-text indexes) • Enumerate all possible Candidate Networks (CN) • Execute the CNs • Most algorithms differ here. • The key is how to optimize for top-k retrieval

Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 c1  p1 c1  p1 P2 <  < C2 C1 C c2  p2 c2  p2 DISCOVER2

Non-Monotonic Scoring Function Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P2 P1 P ? c1  p1 c1  p1 <  < ? C C1 c2  p2 c2  p2 C2 SPARK • Re-establish the early stopping criterion • Check candidates in an optimal order

Upper Bounding Function • Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function • Details • sumidf = widfw • watf(t) = (1/sumidf) * w(tfw(t) * idfw) • A = sumidf * (1 + ln(1 + ln( twatf(t) ))) • B = sumidf * twatf(t) • then, scorea  uscorea = (1/(1-s))*min(A, B) monotonic wrt. watf(t) scoreb scoreuscore are constants given the CN scorec

Early Stopping Criterion Execute a CN Assume: idfnetvista> idfmaxtor and k = 1 CN: PQ CQ P P1 score( )  uscore( ) score( )  uscore( ) stop! P2 C2 C1 C SPARK  • Re-establish the early stopping criterion • Check candidates in an optimal order 

{P1, P2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. • Score(Pi  Cj) = Score(Pi) + Score(Cj) Query Processing … • Execute the CNs Operations: CN: PQ CQ • [P1 ,P1]  [C1 ,C1] • C.get_next() • [P1 ,P1]  C2 • P.get_next() • P2 [C1 ,C2] • P.get_next() • P3 [C1 ,C2] • … // a parametric SQL query is sent to the dbms P P3 P2 P1 C1 C2 C3 C [VLDB 03]

Dominance uscore(<Pi, Cj>) > uscore(<Pi+1, Cj>) and uscore(<Pi, Cj>) > uscore(<Pi, Cj+1>) Skyline Sweeping Algorithm • Execute the CNs Priority Queue: Operations: CN: PQ CQ • <P1 , C1 > • <P2 , C1 >, <P1 , C2 > • <P3 , C1 >, <P1 , C2 >, <P2 , C2 > • <P1 , C2 >, <P2 , C2 >, <P4 , C1 >, <P3 , C2 > • … P • P1C1 • P2C1 • P3C1 P3 P2 P1 C1 C2 C3 C Skyline Sweep  • Re-establish the early stopping criterion • Check candidates in an optimal order sort of

Block Pipeline Algorithm • Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions •  draw an example • Lots of candidates with high uscores return much lower (real) score • unnecessary (expensive) checking • cannot stop earlier • Idea • Partition the space (into blocks) and derive tighter upper bounds for each partitions • “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

Block Pipeline Algorithm … Execute a CN Assume: idfn> idfmand k = 1 CN: PQ CQ P (n:0, m:1) (n:1, m:0) 2.74 2.41 2.38 2.63 2.63 stop! 2.63 2.63 1.05 C (n:1, m:0) (n:0, m:1) 1.05 Block Pipeline  • Re-establish the early stopping criterion • Check candidates in an optimal order 

Efficiency • DBLP • ~ 0.9M tuples in total • k = 10 • PC 1.8G, 512M

Efficiency … • DBLP, DQ13

Conclusions • A system that can perform effective & efficient keyword search on relational databases • Meaningful query results with appropriate rankings • second-level response time for ~10M tuple DB (imdb data) on a commodity PC

Q&A Thank you.

SPARK: Top- k Keyword Query in Relational Database

SPARK: Top- k Keyword Query in Relational Database

Presentation Transcript

Fundamentals of Relational Database Design and Database Planning

Chapter 3: Relational Model

Chapter 3: Relational Model

CSE544 Query Execution

CMPT 454

CS411 Database Systems

Chapter 25: Advanced Transaction Processing

SQL – Structured Query Langauge

Chapter 6 Relational Database Design

Chapter 5 Relational Database Management Systems and SQL

Unit 3: Microsoft Transact SQL and the Query Analyzer

AM18 ASA INTERNALS: QUERY EXECUTION AND OPTIMIZATION

Chapter 5

Relational Database Design

Lecture 11 Introduction to Relational Database

Database Systems The Relational Data Model

Chapter 23

Module 3: Relational Model

Explain the four primary traits that determine the value of information.

Temple University – CIS Dept. CIS616– Principles of Database Systems

Introduction to Database Systems Queries in SQL