1 / 32

Efficient IR-Style Keyword Search over Relational Databases

Enable information retrieval-style keyword search over databases without the need for schema knowledge or query language expertise.

cestrella
Download Presentation

Efficient IR-Style Keyword Search over Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient IR-Style Keyword Searchover Relational Databases • Vagelis Hristidis University of California, San Diego • Luis Gravano Columbia University • Yannis Papakonstantinou University of California, San Diego

  2. Motivation • Keyword search is the dominant information discovery method in documents • Increasing amount of data is stored in databases • Plain text coexists with structured data

  3. Motivation • Up until recently, information discovery in databases required: • Knowledge of schema • Knowledge of a query language (e.g., SQL) • Knowledge of the role of the keywords • Goal: Enable IR-style keyword search over DBMSs without the above requirements

  4. IR-Style Search over DBMSs • IR keyword search well developed for document search • Modern DBMSs offer IR-style keyword search over individual text attributes • What is equivalent to document in databases?

  5. Example – Complaints Database Schema

  6. Example - Complaints Database Data Complaints Customers Products

  7. Example – Keyword Query [Maxtor Netvista] Complaints Customers Products

  8. Keyword Query Semantics (definition of “document” in databases) • Keywords are: • in same tuple • in same relation • in tuples connected through primary-foreign key relationships • Score of result: • distance of keywords within a tuple • distance between keywords in terms of primary-foreign key connections • IR-style score of result tree

  9. Example – Keyword Query [Maxtor Netvista] Complaints Customers Products Results: (1) c3, (2) p2c3, (3) p1c1

  10. Result of Keyword Query • Result is tree T of tuples where: • each edge corresponds to a primary-foreign key relationship • no tuple of T is redundant (minimality) • - “AND” query semantics: Every query keyword appears in T • - “OR” query semantics: Some query keywords might be missing from T

  11. Score of Result T • Combining function Score combines scores of attribute values of T • One reasonable choice: Score=aTScore(a)/size(T) • Attribute value scores Score(a)calculated using the DBMS's IR “datablades”

  12. Shortcomings of Prior Work • Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies • No straightforward extension to improve efficiency by returning just top-k results • Not good in handling free-text attributes [DBXplorer,DISCOVER]

  13. Example – Keyword Query [Maxtor Netvista] Complaints Score(c3) = 4/3 Score(p1 c1) = (1+1/3)/2 = 4/6 Customers Products Score(p2 c3) = (1+4/3)/2 = 7/6 Results: (1) c3, (2) p2c3, (3) p1c1

  14. Architecture ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] [Maxtor Netvista] ComplaintsQ ProductsQ ComplaintsQ ProductsQ ComplaintsQ Customer{}ComplaintsQ ComplaintsQ Product{} ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... c3 p2  c3 p1  c2

  15. Architecture ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] [Maxtor Netvista] ComplaintsQ ProductsQ ComplaintsQ ProductsQ ComplaintsQ Customer{}ComplaintsQ ComplaintsQ Product{} ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... c3 p2  c3 p1  c2

  16. Candidate Network Generator • Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002] • Use single non-free tuple set for each relation • allows “OR” semantics • fewer CNs are generated • extra filtering step required for “AND” semantics

  17. Candidate Network Generator Example For query [Maxtor Netvista], CNs: • ComplaintsQ • ProductsQ • ComplaintsQ  ProductsQ • ComplaintsQ  Customer{}ComplaintsQ • ComplaintsQ Product{} ComplaintsQ Non-CNs: • ComplaintsQ  Customer{}Complaints{} • Product Q Complaints{} ProductQ

  18. Architecture ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)] [Maxtor Netvista] ComplaintsQ ProductsQ ComplaintsQ ProductsQ ComplaintsQ Customer{}ComplaintsQ ComplaintsQ Product{} ComplaintsQ ... SELECT * FROM ComplaintsQ c, ProductsQ p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?; ... c3 p2  c3 p1  c2

  19. Execution Algorithms • Users usually want top-k results. • Hence, submitting to DBMS a SQL query for each CN (Naïve algorithm) is inefficient. • When queries produce at most very few results, Naïve algorithm is efficient, since it fully exploits DBMS. • Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

  20. Sparse Algorithm: Example Execution p1 9 9 c2 7 7 c1  p1 (9+5)/2=7 (9+7)/2 = 8 • Best when query produces at most a few results

  21. Single Pipelined Algorithm: Example Execution CN: ComplaintsQ ProductsQ Results queue Get next tuple from most promising non-free tuple set MPFS = Max[(5+9)/2, (7+6)/2]=7 Max[(1+9)/2, (7+6)/2]=6.5 Max[(1+9)/2, (7+1)/2]=5 p1→c1 7 p2→c2 6.5 Output: p1→c1 p2→c2

  22. Global Pipelined Algorithm : Example Execution global MPFS=max(MPFSi) over all CNs Ci • Best when query produces many results.

  23. Hybrid Algorithm • Estimate number of results. • For “OR”-semantics, use DBMS estimator • For “AND”-semantics, probabilistically adjust DBMS estimator. • If at most a few query results expected, then use Sparse Algorithm. • If many query results expected, then use Global Pipelined Algorithm.

  24. Related Work • DBXplorer [ICDE 2002], DISCOVER [VLDB 2002] • Similar three-step architecture • Score = 1/size(T) • Only AND semantics • No straightforward extension for efficient top-k execution • BANKS [ICDE 2002], Goldman et al. [VLDB 1998] • Database viewed as graph • No use of schema • Florescu et al. [WWW 2000], XQuery Full-Text • Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001] • Top-k algorithms for join queries

  25. Experiments – DBLP Dataset C: Conference Y: Year P: Paper A: Author DBLP contains few citation edges. Synthetic citation edges were added such that average # citations is 20. Final dataset is 56MB. Experiments run over state-of-the-art commercial RDBMS.

  26. OR Semantics: Effect of Maximum Allowed CN Size Average execution time of 100 2-keyword top-10 queries

  27. OR Semantics: Effect of Number of Objects Requested k Average execution time of 100 2-keyword queries with maximum candidate-network size of 6

  28. OR Semantics: Effect of Number of Query Keywords Average execution time of 100 top-10 queries with maximum candidate-network size of 6

  29. Conclusions • Extend IR-style ranking to databases. • Exploit text-search capabilities of modern DBMSs, to generate results of higher quality. • Support both “AND” and “OR” semantics. • Achieve substantial speedup over prior work via pipelined top-k query processing algorithms.

  30. Questions?

  31. Compare algorithms wrt Result size OR-semantics AND-semantics Max CN size = 6, top-10, 2 keywords, OR-semantics

  32. Ranking Functions • Proposed algorithms support tuple monotone combining functions • That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

More Related