270 likes | 406 Views
10000 foot view of Data Management. . . . Structured. Unstructured. ComplexandStructured. RankedKeywordSearch. Data. Queries. DatabaseSystems. InformationRetrievalSystems. . . . 10000 foot view of Data Management. . . . Structured. Unstructured. ComplexandStructured. RankedKeywordSearch.
E N D
1. Towards Unifying Database Systems and Information Retrieval Systems Jayavel Shanmugasundaram
Cornell University
2. 10000 foot view of Data Management
3. 10000 foot view of Data Management
4. Case Study: Internet Archive
5. Internet Archive Database
6. Main Issue Traditional IR ranking methods would rank the two movies about the same
Example: TF-IDF
“Golden Gate” appears exactly once in both descriptions
Length of the text fields are about the same
Hence: same normalized TF-IDF score
Larger issue: Traditional IR scoring methods developed for stand-alone document collections
7. Internet Archive Database
8. Structured Value Ranking(Guo et al., 2005) Use structured data values associated with text columns to score results
Main technical challenge
Structured data value (and hence scores) change frequently and possibly dramatically!
Number of visits, downloads, award announcements
“SlashDot effect”
Bursts and rapidly changing popularity [Kleinberg]
Users still want to see results ordered by latest score values
Current focus: design efficient inverted lists
9. System Architecture
10. Index Operations Document score updates
Handle frequent updates to scores
Top-k keyword queries
Conjunctive and disjunctive keyword queries
Include IR-style (TF-IDF) scores
Top-k query results
Content updates, insertions and deletions
Update to document content
Document insertions and deletions
11. Naďve Approach 1: ID Method Score updates: efficient (just update score table)
Top-k queries: inefficient (scan all of inverted list)
12. Naďve Approach 2: Score Method Top-k queries: efficient (top part of inverted list)
Score updates: inefficient (reorganize many lists)
13. Dilemma Want inverted lists ordered by score
For top-k query performance
Like in Score Method
But do not want to touch inverted lists for every score update
For score update performance
Like in ID Method
How can we address this apparent dilemma?
14. Score-Threshold Method Extends Score Method in two key aspects
Allow inverted list scores to be out-of-date by up to a threshold
Avoids having to frequently update inverted list
Better score update performance
Need to scan more of inverted list (by up to a threshold) to correct for out-of-date score
Slightly reduced query performance
Use “short” inverted list for scores that exceed threshold
More efficient than updating large inverted list
15. Score-Threshold Method
16. Score-Threshold Method
17. Score-Threshold Method
18. Score-Threshold Method
19. Score-Threshold Method
20. Query-Update Tradeoff Choice of threshold function
If threshold(score) = 0
Every update results in update to inverted list
Similar to Score Method
If threshold(score) = infinity
No inverted list update, but scan all of list
Similar to ID Method
Can control query-update tradeoff using threshold function
threshold(score) = r * score, r >= 0
r: threshold ratio
21. Experimental Setup Two primary performance metrics
Time for a score update
Time for a top-k query
Data sets
Real (Internet Archive): 60MB
Thanks to Brewster Kahle and Jon Aizen
Synthetic: 805MB
Compared alternatives
Implemented in C++ on top of BerkeleyDB
2.7GHz 1GB processor
22. Varying # Updates
23. 10000 foot view of Data Management
24. XML Keyword Search Example applications
Accident reports, Shakespeare’s plays
XRank: Keyword search over semi-structured XML documents
Extends keyword search to work over both structured and unstructured data
SIGMOD 2003 [Guo, Shao, Botev, Shanmugasundaram]
25. 10000 foot view of Data Management
26. Towards Unifying DB and IR Example applications
Content management, web querying
TeXQuery: Query language for structured and unstructured data, structured and keyword queries
Precursor to W3C XQuery Full-Text
WWW 2004 [Amer-Yahia, Botev, Shanmugasundaram]
27. Related Work Integrating DB and IR systems
For the most part, treat individual systems as “black boxes”
Our goal is to unify DB and IR systems
Search over Semi-Structured Data
Specialized techniques for search semi-structured data
Our goal is to generalize DB and IR techniques
Keyword search and ranking in databases
BANKS, DBXplorer, DISCOVER
28. Summary Many emerging applications require a unification of DB and IR techniques
E-commerce, content management, …
Argues for a new generation of systems and techniques that seamlessly provide this capability
SVR, XRank, TeXQuery, …
Educational benefit: present unified view of data management
Currently at graduate level
Eventually introduce concepts at undergraduate level