150 likes | 404 Views
Internet Archive Database. Movies. Name. 10. Amateur Film ... they stand ... The Internet is enabling end-users to directly ask queries and explore results ...
E N D
Slide 1:CAREER: Towards Unifying Database Systems andInformation Retrieval Systems NSF IDM Workshop
10 Oct 2004
Jayavel Shanmugasundaram
Cornell University
Slide 2:10000 foot view of Data Management
Slide 3:10000 foot view of Data Management
Slide 4:Internet Archive Database
Slide 5:Internet Archive Database
Slide 6:Structured Value Ranking Use structured data values associated with text columns to score results
Main technical challenge
Need to produce top-k results efficiently
Order inverted lists by score
But scores change frequently [Aizen et al., 2004]
Flash crowds on Internet
Recent award announcements
How can we process top-k results efficiently while allowing frequent score updates?
Slide 7:Solution Overview Order inverted lists by score
Queries efficient
Score updates slow
Order inverted lists by document id
Queries slow
Score updates efficient
Hybrid solution: order inverted lists by chunk
Order chunks by score
Order documents within chunk by id
Guo et al. [ICDE 2005]
Slide 8:10000 foot view of Data Management
Slide 9:Applications Content management
Mix of structured and unstructured data
Database with date and time of accident (structured data) and accident description (unstructured data)
Semi-structured data
Scientific documents, Shakespeare’s plays, …
Support flexible keyword search interface over mix of structured and unstructured data
XRANK [Guo et al., SIGMOD 2003]
Slide 10:XML Keyword Search
Slide 11:10000 foot view of Data Management
Slide 12:Applications The Internet is enabling end-users to directly ask queries and explore results
E.g., Used car marketplace
Find all “bright red ford mustangs” that cost less than 20% of the average price of cars in its class
Characteristics of queries
Keyword search (for ease of use)
Complex query operations (information synthesis)
Want to see ranked results!
Slide 13:Towards Unifying DB and IR No standard query language for both DB and IR
SQL, XQuery mostly “database query languages”
Have developed TeXQuery: a full-text search extension to XQuery
Amer-Yahia et al. (WWW 2004)
Full composability of database and IR primitives, ranking
Adopted as the precursor to the XQuery full-text extensions currently being developed by the W3C
Come see demo tomorrow
Slide 14:Related Work Integrating DB and IR systems
For the most part, treat individual systems as “black boxes”
Our goal is to unify DB and IR systems
Search over Semi-Structured Data
Specialized techniques for search semi-structured data
Our goal is to generalize DB and IR techniques
Keyword search and ranking in databases
Slide 15:Summary Many emerging applications require a unification of DB and IR techniques
E-commerce applications
Semi-structured documents
Content management
Argues for a new generation of systems and techniques that seamlessly provide this capability
SVR, XRank, TeXQuery, …
Educational benefit: present unified view of data management
Currently at graduate level
Eventually introduce concepts at undergraduate level