Efficient IR-Style Keyword Search over Relational Databases

Efficient IR-Style Keyword Searchover Relational Databases • Vagelis Hristidis University of California, San Diego • Luis Gravano Columbia University • Yannis Papakonstantinou University of California, San Diego

Motivation • Keyword search is the dominant information discovery method in documents • Increasing amount of data is stored in databases • Plain text coexists with structured data

Motivation • Up until recently, information discovery in databases required: • Knowledge of schema • Knowledge of a query language (e.g., SQL) • Knowledge of the role of the keywords • Goal: Enable IR-style keyword search over DBMSs without the above requirements

IR-Style Search over DBMSs • IR keyword search well developed for document search • Modern DBMSs offer IR-style keyword search over individual text attributes • What is equivalent to document in databases?

Example – Complaints Database Schema

Example - Complaints Database Data Complaints Customers Products

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products

Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree

Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1

Result of Keyword Query • Result is tree T of tuples where: • each edge corresponds to a primary-foreign key relationship • no tuple of T is redundant (minimality) • - “AND” query semantics: Every query keyword appears in T • - “OR” query semantics: Some query keywords might be missing from T

Score of Result T • Combining function Score combines scores of attribute values of T • One reasonable choice: Score=aTScore(a)/size(T) • Attribute value scores Score(a)calculated using the DBMS's IR “datablades”

Shortcomings of Prior Work • Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies • No straightforward extension to improve efficiency by returning just top-k results • Not good in handling free-text attributes [DBXplorer,DISCOVER]

Example – Keyword Query [Maxtor Netvista] Complaints Score(c3) = 4/3 Score(p1 c1) = (1+1/3)/2 = 4/6 Customers Products Score(p2 c3) = (1+4/3)/2 = 7/6 Results: (1) c3, (2) p2c3, (3) p1c1

Architecture ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] [Maxtor Netvista] ComplaintsQ ProductsQ ComplaintsQ ProductsQ ComplaintsQ Customer{}ComplaintsQ ComplaintsQ Product{} ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... c3 p2  c3 p1  c2

Candidate Network Generator • Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002] • Use single non-free tuple set for each relation • allows “OR” semantics • fewer CNs are generated • extra filtering step required for “AND” semantics

Candidate Network Generator Example For query [Maxtor Netvista], CNs: • ComplaintsQ • ProductsQ • ComplaintsQ  ProductsQ • ComplaintsQ  Customer{}ComplaintsQ • ComplaintsQ Product{} ComplaintsQ Non-CNs: • ComplaintsQ  Customer{}Complaints{} • Product Q Complaints{} ProductQ

Architecture ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] [Maxtor Netvista] ComplaintsQ ProductsQ ComplaintsQ ProductsQ ComplaintsQ Customer{}ComplaintsQ ComplaintsQ Product{} ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... c3 p2  c3 p1  c2

Execution Algorithms • Users usually want top-k results. • Hence, submitting to DBMS a SQL query for each CN (Naïve algorithm) is inefficient. • When queries produce at most very few results, Naïve algorithm is efficient, since it fully exploits DBMS. • Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

Sparse Algorithm: Example Execution p1 9 9 c2 7 7 c1  p1 (9+5)/2=7 (9+7)/2 = 8 • Best when query produces at most a few results

Single Pipelined Algorithm: Example Execution CN: ComplaintsQ ProductsQ Results queue Get next tuple from most promising non-free tuple set MPFS = Max[(5+9)/2, (7+6)/2]=7 Max[(1+9)/2, (7+6)/2]=6.5 Max[(1+9)/2, (7+1)/2]=5 p1→c1 7 p2→c2 6.5 Output: p1→c1 p2→c2

Global Pipelined Algorithm : Example Execution global MPFS=max(MPFSi) over all CNs Ci • Best when query produces many results.

Hybrid Algorithm • Estimate number of results. • For “OR”-semantics, use DBMS estimator • For “AND”-semantics, probabilistically adjust DBMS estimator. • If at most a few query results expected, then use Sparse Algorithm. • If many query results expected, then use Global Pipelined Algorithm.

Related Work • DBXplorer [ICDE 2002], DISCOVER [VLDB 2002] • Similar three-step architecture • Score = 1/size(T) • Only AND semantics • No straightforward extension for efficient top-k execution • BANKS [ICDE 2002], Goldman et al. [VLDB 1998] • Database viewed as graph • No use of schema • Florescu et al. [WWW 2000], XQuery Full-Text • Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001] • Top-k algorithms for join queries

Experiments – DBLP Dataset C: Conference Y: Year P: Paper A: Author DBLP contains few citation edges. Synthetic citation edges were added such that average # citations is 20. Final dataset is 56MB. Experiments run over state-of-the-art commercial RDBMS.

OR Semantics: Effect of Maximum Allowed CN Size Average execution time of 100 2-keyword top-10 queries

OR Semantics: Effect of Number of Objects Requested k Average execution time of 100 2-keyword queries with maximum candidate-network size of 6

OR Semantics: Effect of Number of Query Keywords Average execution time of 100 top-10 queries with maximum candidate-network size of 6

Conclusions • Extend IR-style ranking to databases. • Exploit text-search capabilities of modern DBMSs, to generate results of higher quality. • Support both “AND” and “OR” semantics. • Achieve substantial speedup over prior work via pipelined top-k query processing algorithms.

Questions?

Compare algorithms wrt Result size OR-semantics AND-semantics Max CN size = 6, top-10, 2 keywords, OR-semantics

Ranking Functions • Proposed algorithms support tuple monotone combining functions • That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

Efficient IR-Style Keyword Search over Relational Databases