150 likes | 276 Views
CAREER: Towards Unifying Database Systems and Information Retrieval Systems. NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University. 10000 foot view of Data Management. Information Retrieval Systems. Ranked Keyword Search. Queries. Complex and Structured. Database
E N D
CAREER: Towards Unifying Database Systems andInformation Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University
10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data
Text searchin databases Ranking based on structured values 10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data
Internet Archive Database Movies Name Description Mid … they stand on the golden gate bridge and … 10 Amateur Film 20 American Thrift … golden gate bridge with statue of liberty … … … … SELECT * FROM Movies M ORDER BY score(M.description, “golden gate”) FETCH TOP 10 RESULTS ONLY • Traditional IR scoring methods (e.g., TF*IDF) often not very meaningful in this context • Developed for stand-alone document collections
Internet Archive Database Movies Name Description Mid … they stand on the golden gate bridge and … 10 Amateur Film 20 American Thrift … golden gate bridge with statue of liberty … … … … Statistics Reviews Visits Downloads Sid Mid Name Rating Rid Mid 90 10 285 81 2 10 bleblanc 901 82 20 927 247 902 10 harry 1 … … … … 20 903 cooker 4 20 904 alice 5 … … … … Structured Value Ranking (SVR)
Structured Value Ranking • Use structured data values associated with text columns to score results • Main technical challenge • Need to produce top-k results efficiently • Order inverted lists by score • But scores change frequently [Aizen et al., 2004] • Flash crowds on Internet • Recent award announcements • How can we process top-k results efficiently while allowing frequent score updates?
Solution Overview • Order inverted lists by score • Queries efficient • Score updates slow • Order inverted lists by document id • Queries slow • Score updates efficient • Hybrid solution: order inverted lists by chunk • Order chunks by score • Order documents within chunk by id • Guo et al. [ICDE 2005]
10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data
Applications • Content management • Mix of structured and unstructured data • Database with date and time of accident (structured data) and accident description (unstructured data) • Semi-structured data • Scientific documents, Shakespeare’s plays, … • Support flexible keyword search interface over mix of structured and unstructured data • XRANK [Guo et al., SIGMOD 2003]
XML Keyword Search <workshopdate=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paperid=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <sectionname=”Introduction”> Searching on structured text is becoming more important with XML … </section> … <citexmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> … • Most specific results (exploits structure!) • Ranking at granularity of elements (generalizes PageRank)
10000 foot view of Data Management Information Retrieval Systems Ranked Keyword Search Queries Complex and Structured Database Systems Structured Unstructured Data
Applications • The Internet is enabling end-users to directly ask queries and explore results • E.g., Used car marketplace • Find all “bright red ford mustangs” that cost less than 20% of the average price of cars in its class • Characteristics of queries • Keyword search (for ease of use) • Complex query operations (information synthesis) • Want to see ranked results!
Towards Unifying DB and IR • No standard query language for both DB and IR • SQL, XQuery mostly “database query languages” • Have developed TeXQuery: a full-text search extension to XQuery • Amer-Yahia et al. (WWW 2004) • Full composability of database and IR primitives, ranking • Adopted as the precursor to the XQuery full-text extensions currently being developed by the W3C • Come see demo tomorrow
Related Work • Integrating DB and IR systems • For the most part, treat individual systems as “black boxes” • Our goal is to unify DB and IR systems • Search over Semi-Structured Data • Specialized techniques for search semi-structured data • Our goal is to generalize DB and IR techniques • Keyword search and ranking in databases
Summary • Many emerging applications require a unification of DB and IR techniques • E-commerce applications • Semi-structured documents • Content management • Argues for a new generation of systems and techniques that seamlessly provide this capability • SVR, XRank, TeXQuery, … • Educational benefit: present unified view of data management • Currently at graduate level • Eventually introduce concepts at undergraduate level