210 likes | 322 Views
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks. Ablimit Aji Emory University. Martin Theobald Max Planck Institute Informatics. Ralf Schenkel Saarland University. Outline. Ad-hoc Focused. Query rewriting Data & scoring model Distributed indexing (new for 2009!)
E N D
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks AblimitAji Emory University Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University
Outline Ad-hoc Focused • Query rewriting • Data & scoring model • Distributed indexing (new for 2009!) • Query processing • Results • Ad-hoc • Efficiency Efficiency Focused
Query Rewriting I (NEXI/XPath-FT) • CAS Queries • //article//(sec|p)[(about(.//header, “Yoga Lessons” ) or about(.//title, +Yoga -history)) and about(.//figure, exercise) ] • Query DAGs • tag-term pairs as leafs • navigational tags as support elements • Discard all Boolean constraints, “andish” mode for both CO and CAS // article sec p // header$ yoga header$ lesson title$ yoga figure$ exercise self
Query Rewriting II (NEXI) • CO Queries • “Yoga Lessons” +Yoga -history exercise • //*[about(., “Yoga Lessons” +Yoga -history exercise)] • Virtual * tag, fully pre-computed and materialized in inverted lists as *-term pairs • Can be generalized to specific tag classes (e.g. <article|sec|p>) *$yoga *$lesson *$exercise self self
article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ ftf (“xml”, article1 ) = 4 Data Model “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> “native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ ftf (“xml”, sec4 ) = 2 • XML Trees (no XLink/ID/IDRef) • Pre-/post-order ranges for the structure • Redundant full-content text nodes
Scoring Model [TopX @ INEX ’05–’09] Content Index (Tag-Term Pairs) Element Freq. Element Statistics • XML-specific variant of Okapi BM25 (aka. E-BM25, Robertson et al. [INEX ‘05]) with k1 = 2.0, b=0.75 decay factor for ftf of 0.925 author[“gates”] vs. section[“gates”]
How to create a full CAS index for a large XML collection efficiently? • TopX index statistics for Wikipedia 2009 (55 GB XML sources) • Go distributed!
Distributed Indexing I Top-k Engine • Two-level hashing: • At query processing time: • hash(ti) NodeId|FileId|ByteOffset • (64-bit dictionary) • At Indexing Time: • FileId(ti) = hash(ti) mod f • NodeId (ti) = FileId(ti) mod p Node1 Node2 Nodep … File[1] … File[f/p] File[(f/p)+1] … File[2f/p] File[(p-1)(f/p)+1] … File[f] … tag$term1 tag$term3 … tag$term2 tag$term4 … tag$term4 tag$term5 … … Docs[1, …, n/p] Docs[(n/p)+1, …, 2n/p] Docs[(p-1)/(n/p)+1, …, n]
Distributed Indexing II • Shared dictionary is mapping 64-bit keys 64-bit values • Using hash(ti) as keys • Using 8 bits/NodeId, 12 bits/FileId, 44 bits/ByteOffset as values • Max. distributed index size: 4,096 x 244 bytes = 16 Terabytes (Dictionary itself takes ~4 GB for 200 million keys)
Index Files: Inverted Block Structure for CAS Queries pre post score sec[“xml”] 0 Element Block Doc-ID 1 • Group element blocks with similar Max-Score into document blocks of fixed length (e.g. 256KB) • Sort element blocks within each document block by Doc-ID • Supports • Sequential (“sorted”) access by descending max(Max-Score) • Merge-joins by Doc-ID • Dynamic top-k pruning, efficient merge-joins over large blocks Doc-ID 2 Max-Sore Document Block ≤ 256KB SA Doc-ID 5 title[“xml”] 122,564 Doc-ID 3 Doc-ID 6 Max-Sore … L …
Merging BlocksIncrementally //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] sec[“xml”] par[“retrieval”] 1 2 Max(Max-Score): 0.9 1.0 Sorted access and efficient merge-joins on top of large document blocks from disk SA 4 2 7 5 3 5 0.6 0.8 6 6 … …
Some more tricks… • Dump leading histogram blocks directly into index list headers • Histograms only for index lists that exceed one document block (<5% of all lists) • Supports probabilistic pruning and cost-based index access scheduling[Prob-Top-K, VLDB ’04; IO-Top-K, VLDB ’06] • Efficient on-the-fly index decompression (S16), internal caching of decompressed index lists • Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks ~36 bytes sec[“xml”] DB1(256 KB) DB2(256 KB) Histogram Block freq EB 1 EB 2 … EB k … … 1 0 score
Runs • Ah-hoc Track (Article-Only, CO & CAS) • Focused • Best-In-Context • Thorough • Efficiency • Type (A) Focused (same as Ad-Hoc Focused) • Top-15, Top-150, Top-1500, Article-Only, CO & CAS • Type (B) Focused, CO only • Top-15 only, but up to 96 keywords/query
Future Work • Phrase-matching & proximity ranking(non-monotonic!) • “Holistic” Top-k for XQuery • Multiple XPaths per XQuery • Efficient inter-document retrieval • Complex Boolean constraints among paths • Updates! • Full-fledged open-source platform for W3C XQuery Full-Text