1 / 27

Towards Unifying Database Systems and Information Retrieval Systems

10000 foot view of Data Management. . . . Structured. Unstructured. ComplexandStructured. RankedKeywordSearch. Data. Queries. DatabaseSystems. InformationRetrievalSystems. . . . 10000 foot view of Data Management. . . . Structured. Unstructured. ComplexandStructured. RankedKeywordSearch.

ziv
Download Presentation

Towards Unifying Database Systems and Information Retrieval Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Towards Unifying Database Systems and Information Retrieval Systems Jayavel Shanmugasundaram Cornell University

    2. 10000 foot view of Data Management

    3. 10000 foot view of Data Management

    4. Case Study: Internet Archive

    5. Internet Archive Database

    6. Main Issue Traditional IR ranking methods would rank the two movies about the same Example: TF-IDF “Golden Gate” appears exactly once in both descriptions Length of the text fields are about the same Hence: same normalized TF-IDF score Larger issue: Traditional IR scoring methods developed for stand-alone document collections

    7. Internet Archive Database

    8. Structured Value Ranking (Guo et al., 2005) Use structured data values associated with text columns to score results Main technical challenge Structured data value (and hence scores) change frequently and possibly dramatically! Number of visits, downloads, award announcements “SlashDot effect” Bursts and rapidly changing popularity [Kleinberg] Users still want to see results ordered by latest score values Current focus: design efficient inverted lists

    9. System Architecture

    10. Index Operations Document score updates Handle frequent updates to scores Top-k keyword queries Conjunctive and disjunctive keyword queries Include IR-style (TF-IDF) scores Top-k query results Content updates, insertions and deletions Update to document content Document insertions and deletions

    11. Naďve Approach 1: ID Method Score updates: efficient (just update score table) Top-k queries: inefficient (scan all of inverted list)

    12. Naďve Approach 2: Score Method Top-k queries: efficient (top part of inverted list) Score updates: inefficient (reorganize many lists)

    13. Dilemma Want inverted lists ordered by score For top-k query performance Like in Score Method But do not want to touch inverted lists for every score update For score update performance Like in ID Method How can we address this apparent dilemma?

    14. Score-Threshold Method Extends Score Method in two key aspects Allow inverted list scores to be out-of-date by up to a threshold Avoids having to frequently update inverted list Better score update performance Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score Slightly reduced query performance Use “short” inverted list for scores that exceed threshold More efficient than updating large inverted list

    15. Score-Threshold Method

    16. Score-Threshold Method

    17. Score-Threshold Method

    18. Score-Threshold Method

    19. Score-Threshold Method

    20. Query-Update Tradeoff Choice of threshold function If threshold(score) = 0 Every update results in update to inverted list Similar to Score Method If threshold(score) = infinity No inverted list update, but scan all of list Similar to ID Method Can control query-update tradeoff using threshold function threshold(score) = r * score, r >= 0 r: threshold ratio

    21. Experimental Setup Two primary performance metrics Time for a score update Time for a top-k query Data sets Real (Internet Archive): 60MB Thanks to Brewster Kahle and Jon Aizen Synthetic: 805MB Compared alternatives Implemented in C++ on top of BerkeleyDB 2.7GHz 1GB processor

    22. Varying # Updates

    23. 10000 foot view of Data Management

    24. XML Keyword Search Example applications Accident reports, Shakespeare’s plays XRank: Keyword search over semi-structured XML documents Extends keyword search to work over both structured and unstructured data SIGMOD 2003 [Guo, Shao, Botev, Shanmugasundaram]

    25. 10000 foot view of Data Management

    26. Towards Unifying DB and IR Example applications Content management, web querying TeXQuery: Query language for structured and unstructured data, structured and keyword queries Precursor to W3C XQuery Full-Text WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram]

    27. Related Work Integrating DB and IR systems For the most part, treat individual systems as “black boxes” Our goal is to unify DB and IR systems Search over Semi-Structured Data Specialized techniques for search semi-structured data Our goal is to generalize DB and IR techniques Keyword search and ranking in databases BANKS, DBXplorer, DISCOVER

    28. Summary Many emerging applications require a unification of DB and IR techniques E-commerce, content management, … Argues for a new generation of systems and techniques that seamlessly provide this capability SVR, XRank, TeXQuery, … Educational benefit: present unified view of data management Currently at graduate level Eventually introduce concepts at undergraduate level

More Related