Efficient Keyword Search over DBLife & DBLP Data

Efficient Keyword Search over DBLife & DBLP Data CS511 (Inprogress) Project Presentation, Dec-09-2005 Mayssam Sayyadian Nhung Nguyen Hieu Li

Introduction • DBLife: Manages Unstructured Data • People are familiar with keyword searching unstructured data • … but, DBLife  ER graph • Entities, mentions, etc. : structured data extracted • DBLP: Well known, available, enriched database of publications • DBLife does not cover all the data in DBLP

Assumption • Data is in relational format, not XML • DBMS provides text indexing at column level • Oracle, SQL Server, DB2, MySql, PostgreSQL • Support for XML data is subject of future work

Basic Model • Database: modeled as a graph • Nodes = tuples • Edges = references between tuples • foreign key, inclusion dependencies, .. • Edges are directed. eTuner: Tuning Schema … iMAP: Discovering … paper writes Mayssam Sayyadian AnHai Doan Pedro Domingos author

Answer Example Query: Mayssam AnHai paper eTuner: Tuning Schema … writes writes author author Mayssam AnHai Doan

Answer Model • Query: set of keywords {k1, k2, .., kn} • Each keyword ki matches set of nodes Si • Answer: rooted, directed tree connecting nodes, with one node from each Si • Root node (we call it an information node) has special significance, may be restricted to some relations • E.g. relations representing entities, not relationships • Multiple answers ranked by a scoring function

Score of Result T • Combining function Score combines scores of attribute values of T • One reasonable choice: Score=aTScore(a)/size(T) • Attribute value scores Score(a)calculated using the DBMS's IR Index

Implementation EasyDB Components JSPs Browser / Client Java Beans Java API Http DBLP JDBC Servlets Http Java API DBLife Web Server

DBLP DBLP DBLife DBLife Searching over Multiple Databases: System Architecture Preprocessing: Offline Querying: Online User Index Builder Q IR Engine DBLife IR Index DBLP IR Index Tuplesets ForeignKey Joins Top-k Generator Join Discovery Schema Matching + SQL Queries Distributed SQL Query Processor

Top-K Generator • Contributions: • Iterative Refinement Algorithm • A unifying framework to search for Top-K best tuple-trees • Cast previous algorithms into IRA • Improve them substantially

IRA Framework • Concepts: • Abstract State, Concrete State, Score Interval • IRA Alg: branch and bound search 1. Abstraction: Create initial abstract states 2. While less than k states output, iteratively: (a) Evaluation: Update the score intervals (b) Elimination: Eliminate (prune) the space of states (c) Refinement: Select an abstract state and refine it (d) If the goal state (the top-1 state) is found: Output it and remove it.

iteration 1 iteration 2 iteration 3 K = {P2, P3}, min score = 0.7 . . . . . . P1 [0.6, 0.8] P [0.6, 1] . P2 0.9 Res = {P2, R2} min score = 0.85 . . . Q [0.5, 0.7] . . . P3 0.7 R1 [0.4, 0.6] . . . . . . . R [0.4, 0.9] R [0.4, 0.9] R2 0.85 IRA - Example

IRA Algorithms • Kite: straight forward adaptation of state of the art algorithm (hybrid) to IRA • aKite: adaptive Kite  able to change and adapt over time • daKite: adaptive Kite algorithm armed with more sophisticated refinement rules (read: more cost effective search heuristics)

Preliminary Experiments • Currently experiments over DBLP data

Future Work • Better UI & Browsing facilities • User feedback • Extend to handle XML data

References • V. Hristidis, L. Gravano, Y. Papakonstantinou, “Efficient IR-Style Keyword Search over Relational Databases” • S. Agrawal, S. Chaudhuri, G Das, “DBXplorer: A System for Keyword Search over Relational Databases” • G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabati, “Keyword Searching and Browsing in Databases using BANKS”

Efficient Keyword Search over DBLife & DBLP Data