Towards Unifying Database Systems and Information Retrieval Systems

1. Towards Unifying Database Systems and Information Retrieval Systems Jayavel Shanmugasundaram Cornell University

2. 10000 foot view of Data Management

3. 10000 foot view of Data Management

4. Case Study: Internet Archive

5. Internet Archive Database

6. Main Issue Traditional IR ranking methods would rank the two movies about the same Example: TF-IDF �Golden Gate� appears exactly once in both descriptions Length of the text fields are about the same Hence: same normalized TF-IDF score Larger issue: Traditional IR scoring methods developed for stand-alone document collections

7. Internet Archive Database

8. Structured Value Ranking(Guo et al., 2005) Use structured data values associated with text columns to score results Main technical challenge Structured data value (and hence scores) change frequently and possibly dramatically! Number of visits, downloads, award announcements �SlashDot effect� Bursts and rapidly changing popularity [Kleinberg] Users still want to see results ordered by latest score values Current focus: design efficient inverted lists

9. System Architecture

10. Index Operations Document score updates Handle frequent updates to scores Top-k keyword queries Conjunctive and disjunctive keyword queries Include IR-style (TF-IDF) scores Top-k query results Content updates, insertions and deletions Update to document content Document insertions and deletions

11. Na�ve Approach 1: ID Method Score updates: efficient (just update score table) Top-k queries: inefficient (scan all of inverted list)

12. Na�ve Approach 2: Score Method Top-k queries: efficient (top part of inverted list) Score updates: inefficient (reorganize many lists)

13. Dilemma Want inverted lists ordered by score For top-k query performance Like in Score Method But do not want to touch inverted lists for every score update For score update performance Like in ID Method How can we address this apparent dilemma?

14. Score-Threshold Method Extends Score Method in two key aspects Allow inverted list scores to be out-of-date by up to a threshold Avoids having to frequently update inverted list Better score update performance Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score Slightly reduced query performance Use �short� inverted list for scores that exceed threshold More efficient than updating large inverted list

15. Score-Threshold Method





20. Query-Update Tradeoff Choice of threshold function If threshold(score) = 0 Every update results in update to inverted list Similar to Score Method If threshold(score) = infinity No inverted list update, but scan all of list Similar to ID Method Can control query-update tradeoff using threshold function threshold(score) = r * score, r >= 0 r: threshold ratio

21. Experimental Setup Two primary performance metrics Time for a score update Time for a top-k query Data sets Real (Internet Archive): 60MB Thanks to Brewster Kahle and Jon Aizen Synthetic: 805MB Compared alternatives Implemented in C++ on top of BerkeleyDB 2.7GHz 1GB processor

22. Varying # Updates

23. 10000 foot view of Data Management

24. XML Keyword Search Example applications Accident reports, Shakespeare�s plays XRank: Keyword search over semi-structured XML documents Extends keyword search to work over both structured and unstructured data SIGMOD 2003 [Guo, Shao, Botev, Shanmugasundaram]

25. 10000 foot view of Data Management

26. Towards Unifying DB and IR Example applications Content management, web querying TeXQuery: Query language for structured and unstructured data, structured and keyword queries Precursor to W3C XQuery Full-Text WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram]

27. Related Work Integrating DB and IR systems For the most part, treat individual systems as �black boxes� Our goal is to unify DB and IR systems Search over Semi-Structured Data Specialized techniques for search semi-structured data Our goal is to generalize DB and IR techniques Keyword search and ranking in databases BANKS, DBXplorer, DISCOVER

28. Summary Many emerging applications require a unification of DB and IR techniques E-commerce, content management, � Argues for a new generation of systems and techniques that seamlessly provide this capability SVR, XRank, TeXQuery, � Educational benefit: present unified view of data management Currently at graduate level Eventually introduce concepts at undergraduate level

Towards Unifying Database Systems and Information Retrieval Systems

Towards Unifying Database Systems and Information Retrieval Systems

Presentation Transcript

CAREER: Towards Unifying Database Systems and

Database & Information Systems

Database Searching and Information Retrieval

Database and Information Systems

Ranking in Information Retrieval Systems

Systems-within-systems: a unifying perspective

Information Storage & Retrieval Systems

Information Retrieval Systems Capabilities

Multimedia Information Retrieval Systems

Evaluation of Information Retrieval Systems

File Systems and Database Systems

Evaluation of XML Information Retrieval Systems

CS 533 Information Retrieval Systems

Evaluating Cross-language Information Retrieval Systems

CAREER: Towards Unifying Database Systems and Information Retrieval Systems

CS 533 Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems

Music information retrieval systems

CS 533 Information Retrieval Systems

Three Information Retrieval Systems

Evaluation of Information Retrieval Systems

Towards Unifying Database Systems and Information Retrieval Systems