160 likes | 272 Views
TopX Efficient & Versatile Top-k Query Processing for Text, Semistructured & Structured Data. Martin Theobald Max-Planck-Institut Informatik Stanford University. article. article. title. title. “ Current Approaches to XML Data Manage- ment ”. “ The X ML Files ”.
E N D
TopX Efficient & Versatile Top-k Query Processing for Text, Semistructured & Structured Data Martin Theobald Max-Planck-Institut Informatik Stanford University
article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS PRUNING
Frontends • Web Interface • Web Service • API TopX Query Processor Probabilistic Index Access Scheduling Candidate Queue Candidate Cache Scan Threads Top-k Queue SA Probabilistic Candidate Pruning Query Processing Time Random Access Sequential Access Dynamic Query Expansion Incremental XPath Engine Auxiliary Predicates RA Index Metadata • Selectivities • Histograms • Correlations Thesaurus WordNet, OpenCyc, etc. DBMS / Inverted Lists Unified Text & XML Schema Indexing Time RA Indexer /Crawler
Data Model article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> ftf (“xml”, article1 ) = 4 “native xml data base native xml data base system store schemaless data“ • XML trees (no XLinks or ID/IDref attributes) • Pre-/postorder node labels • Redundant full-content text nodes
Scoring Model [INEX ’06/’07] • XML-specific extension to Okapi BM25 (originating from probabilistic text IR) • ftf instead of tf • ef instead of df • Element type-specific length normalization • Tunable parameters k1and b bib[“transactions”] vs. par[“transactions”]
TopX Query Processing [VLDB ’05] 19 8 0.8 8 14 89 5 0.4 11 16 … 32 1 0.09 3 1 … 35 4 0.05 5 8 171 46 9 46 46 21 8 0.04 3 20 worst=0.5 worst=0.9 worst=1.0 worst=0.9 … 216 51 28 3 182 28 worst=2.2 worst=1.7 worst=1.6 //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] sec[“xml”] par[“retrieval”] title[“native”] 1.0 1.0 1.0 0.9 0.9 1.0 0.8 0.8 0.85 0.5 0.75 0.1 Top-2 Candidate Queue min-2=0.5 min-2=1.6 min-2=1.0 min-2=0.9 max-q=2.15 max-q=2.55 max-q=2.45 max-q=2.75 max-q=2.8 max-q=3.0 max-q=1.6 max-q=2.9 max-q=2.7 min-2=0.0
Index Access Scheduling [VLDB ’06] 1.0 1.0 1.0 0.9 0.9 Δ3,3 = 0.2 0.9 Δ1,3 = 0.8 0.7 0.9 0.8 0.2 0.6 0.8 … … … Inverted Block Index • SA Scheduling • Look-ahead Δi through precomputed score histograms • Knapsack-based optimization of Score Reduction • RA Scheduling • 2-phase probing: Schedule RAs “late & last” • Extended probabilistic cost model for integrating SA & RA scheduling SA SA SA RA
Probabilistic Pruning [VLDB ’04] 2 0 δ(d) f1 0 1 high1 f2 0 1 high2 • Convolutions of score distributions(assuming independence) P[d gets in the final top-k] = title[“native”] par[“retrieval”] sampling Probabilistic candidate pruning: Drop dfrom the candidate queue if P[d gets in the final top-k] < ε With probabilistic guarantees for precision & recall Indexing Time Query Processing Time
Dynamic Query Expansion [SIGIR ’05] ~disaster … transport d42 d11 d92 d37 tunnel d95 d66 d93 d17 accident disaster fire d95 d11 d42 d37 d78 d99 d101 ... ... d11 d42 d10 d92 d32 d11 d21 d1 d87 ... ... ... TREC Robust Topic #363 Top-k (transport, tunnel, ~disaster) • Incrementally merge inverted lists for expansion ti,1...ti,m in descending order of s(tij, d) • Best-match score aggregation • Specialized expansion operators • Incremental Merge operator • Nested Top-k operator (efficient phrase matching) • Boolean (but ranked) retrieval mode • Supports any sorted inverted index for text, structured records & XML SA SA SA Incr. Merge
Incremental Merge Operator t1 d78 0.9 d23 0.8 d10 0.8 d1 0.4 d88 0.3 ... t2 d10 0.7 d64 0.8 d23 0.8 d12 0.2 d78 0.1 ... 0.4 0.72 0.18 t3 d11 0.9 d78 0.9 d64 0.7 d99 0.7 d34 0.6 ... 0.45 0.35 0.9 ~t Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Initial high-scores Expansion terms ~t = { t1, t2, t3 } Large corpus term correlations sim(t, t1 ) = 1.0 sim(t, t2 ) = 0.9 Expansion similarities sim(t, t3 ) = 0.5 SA d88 0.3 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d78 0.9 ... Meta histograms seamlessly integrate Incremental Merge into probabilistic scheduling and candidate pruning
Some Experiments • New XML-ified Wikipedia corpus (INEX 2006) • 660,000 documents w/ 130,000,000 elements • 125 INEX queries, each as content-only (CO) and content-and-structure (CAS) formulation • CO: +“state machine” figure Mealy Moore • CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )] • Primary cost metric: Cost = #SA + cR/cS #RA
TopX vs. Full-Merge • Significant cost savings for large ranges of k • CAS cheaper than CO !
Efficiency vs. Effectiveness • Very good precision/runtime ratio for probabilistic pruning
Static vs. Dynamic Expansions • Query expansions with up to m=292 keywords & phrases • Balanced amount of sorted vs. random disk access • Adaptive scheduling wrt. cR/cS cost ratio • Dynamic expansions superior to static expansions & full-merge in both efficiency & effectiveness
Thanks… Gerhard Weikum Ralf Schenkel Norbert Fuhr, Michalis Vazirgiannis Holger Bast, Debapriyo Majumdar All the MPI & INEX folks
topx.sourceforge.net See our Sigmod’07 demo!