290 likes | 443 Views
TopX Efficient & Versatile Top-k Query Processing for Semistructured Data. Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel , Gerhard Weikum. article. article. title. title. “ Current Approaches
E N D
TopX Efficient & Versatile Top-k Query Processing for Semistructured Data Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum
article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS PRUNING
Goal: Efficiently retrieve the best (top-k) results of a similarity query • Extend existing thresholdalgorithms for inverted lists [Güntzer, Balke & Kießling, VLDB’00; Fagin, PODS ‘01] to XML data and XPath-like full-text search • Non-schematic, heterogeneous data sources • Efficiently support IR-stylevague search • Combined inverted indexfor content & structure • Avoid full index scans, postpone expensive random accessesto large disk-resident data structures • Exploit cheap disk space for redundant index structures
XML-IR: History and Related Work IR on structured docs (SGML): Web query languages: 1995 OED etc. (U Waterloo) HySpirit (U Dortmund) HyperStorM (GMD Darmstadt) WHIRL (CMU) W3QS (Technion Haifa) Araneus (U Roma) Lorel (Stanford U) WebSQL (U Toronto) IR on XML: XML query languages: XIRQL & HyRex (U Dortmund) XXL & TopX (U Saarland / MPII) ApproXQL (U Berlin / U Munich) ELIXIR (U Dublin) JuruXML (IBM Haifa ) XSearch (Hebrew U) Timber (U Michigan) XRank & Quark (Cornell U) FleXPath (AT&T Labs) XKeyword (UCSD) XML-QL (AT&T Labs) 2000 XPath 1.0 (W3C) NEXI (INEX Benchmark) XPath 2.0 & XQuery 1.0 Full-Text (W3C) XPath 2.0 (W3C) XQuery 1.0 (W3C) TeXQuery (AT&T Labs) Commercial software: MarkLogic, Verity?, IBM?, Oracle?, ... 2005
Frontends • Web Interface • Web Service • API TopX Query Processor 1 Top-k XPath Processing Candidate Queue Candidate Cache Scan Threads Top-k Queue SA 2 Probabilistic Index Access Scheduling Query Processing Time Random Access Sequential Access 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion Auxiliary Predicates RA Index Metadata • Selectivities • Histograms • Correlations Ontology/ Large Thesaurus WordNet, OpenCyc, etc. DBMS / Inverted Lists Unified Text & XML Schema Indexing Time RA Indexer /Crawler
1 Top-k XPath Processing 2 Probabilistic Index Access Scheduling 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion 5 Experiments: TREC & INEX Benchmarks
Data Model article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ ftf (“xml”, article1 ) = 4 <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> “native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ ftf (“xml”, sec4 ) = 2 • XML trees (no XLinks or ID/IDref attributes) • Pre-/postorder node labels • Redundant full-content text nodes (w/stemming, no stopwords)
Scoring Model [INEX ’06/’07] • XML-specific extension to Okapi BM25 (originating from probabilistic IR on unstructured text) • ftfinstead of tf • ef instead of df • Element-type specific length normalization • Tunable parameters k1and b bib[“transactions”] vs. par[“transactions”]
Fagin’s NRA [PODS ´01] at a Glance Corpus:d1,…,dn Find the top-k documents that maximize s(t1,dj ) + s(t2,dj ) + ... + s(tm,dj) non-conjunctive (“andish”) evaluations • NRA(q,L): • scan all lists Li (i = 1..m) in parallel & consider doc d at posi • E(d) := E(d) {i}; • highi = s(ti,d); • worstscore(d) := ∑ s(t,d) | E(d); • bestscore(d) := worstscore(d) + ∑ high | E(d); • if worstscore(d) > min-k then • add d to top-k • min-k := min{worstscore(d’) | d’ top-k}; • else if bestscore(d) > min-k then • candidates := candidates {d}; • if max {bestscore(d’) | d’ candidates} min-kthen • return top-k; s(t1, d10) = 0.8 s(t2,d10) = 0.6 s(t3,d10) = 0.7 Query:q = (t1,t2,t3) Inverted Index k = 1 d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 Scan depth 1 Naive “Merge-then-Sort” approach in between O(mn) and O(mn2) runtime and O(mn)access cost Scan depth 2 … Scan depth 3 d64 0.8 d10 0.6 d13 0.2 d78 0.1 d23 0.6 t2 … STOP! d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 t3 …
Inverted Block-Index for Content & Structure //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] sec[“xml”] par[“retrieval”] title[“native”] SA SA SA RA RA RA • MostlySorted (=sequential) Access to large element blocks on disk • Group elements in descending order of (maxscore, docid) • Block-scan all elements per doc for a given (tag, term) key • Stored as inverted files or database tables • Two B+treeindexes over the full range of attributes (IOTs in Oracle)
Navigational Element Index //sec[about(.//title, “native”] //par[about(.//, “retrieval”)] sec title[“native”] par[“retrieval”] SA SA RA RA RA • Additional index for tag paths • RAs on B+treeindex using(docid, tag) as key • Few & judiciously scheduled “expensive predicate” probes • Schema-oblivious indexing & querying • Non-schematic XML data (no DTD required) • Supports full NEXI syntax & all 13 XPath axes (+level)
TopX Query Processing Example 171 9 46 46 9 46 9 46 46 9 46 46 9 46 171 171 171 171 84 9 171 worst=0.9best=2.7 worst=0.5 worst=0.5best=2.4 worst=0.9 worst=2.2best=2.2 worst=0.9best=2.8 worst=0.5best=2.3 worst=0.9best=2.9 worst=1.7best=2.5 worst=0.5best=1.3 worst=0.5best=2.5 worst=0.5best=0.5 worst=0.9best=1.0 worst=0.9best=2.55 worst=0.9best=2.8 worst=0.9best=1.8 worst=0.9best=2.75 worst=1.0best=2.8 worst=1.0best=2.65 worst=1.0best=1.6 worst=1.0 worst=1.0best=1.9 worst=1.0best=2.75 worst=0.9 worst=0.85best=2.75 72 51 72 51 216 51 72 216 51 216 216 216 216 28 3 28 182 3 28 28 3 3 28 28 3 182 3 worst=0.8best=2.65 worst=2.2 worst=1.7 worst=0.1best=0.9 worst=0.8best=1.6 worst=0.8best=2.45 worst=1.6 worst=0.0 best=1.4 worst=1.6best=2.1 worst=0.0best=2.9 worst=0.0best=2.8 worst=0.0 best=1.35 worst=0.0best=2.65 worst=0.0best=2.45 worst=0.0best=1.7 worst=0.0best=2.75 worst=0.85best=2.45 worst=0.85best=2.65 worst=0.85best=2.15 Top-2 results //sec[about(.//, “XML”) and about(.//title, “native”] //par[about(.//, “retrieval”)] min-2=0.5 min-2=1.6 min-2=1.0 min-2=0.9 min-2=0.0 par[“retrieval”] sec[“xml”] title[“native”] 1.0 1.0 1.0 0.9 1.0 0.9 0.8 0.8 0.5 0.85 0.75 0.1 doc2 doc17 doc1 doc5 Pseudo- doc doc3 Candidate queue
1 Top-k XPath Processing 2 Probabilistic Index Access Scheduling 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion 5 Experiments: TREC & INEX Benchmarks
Index Access Scheduling [VLDB ’06] 1.0 1.0 1.0 0.9 0.9 Δ3,3 = 0.2 0.9 Δ1,3 = 0.8 0.7 0.9 0.8 0.2 0.6 0.8 … … … Inverted Block Index • SA Scheduling • Look-ahead Δi through precomputed score histograms • Knapsack-based optimization of Score Reduction • RA Scheduling • 2-phase probing: Schedule RAs “late & last”, i.e., cleanup the queue if • Extended probabilistic cost model for integrating SA & RA scheduling SA SA SA RA
1 Top-k XPath Processing 2 Probabilistic Index Access Scheduling 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion 5 Experiments: TREC & INEX Benchmarks
Probabilistic Candidate Pruning [VLDB ’04] 2 0 δ(d) f1 0 1 high1 f2 0 1 high2 • Convolutions of score distributions(assuming independence) P[d gets in the final top-k] = title[“native”] par[“retrieval”] sampling Probabilistic candidate pruning: Drop dfrom the candidate queue if P[d gets in the final top-k] < ε With probabilistic guarantees for precision & recall Indexing Time Query Processing Time
1 Top-k XPath Processing 2 Probabilistic Index Access Scheduling 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion 5 Experiments: TREC & INEX Benchmarks
Dynamic Query Expansion [SIGIR ’05] ~disaster … transport d42 d11 d92 d37 tunnel d95 d66 d93 d17 accident disaster fire d95 d11 d42 d37 d78 d99 d101 ... ... d11 d42 d10 d92 d32 d11 d21 d1 d87 ... ... ... TREC Robust Topic #363 Top-k (transport, tunnel, ~disaster) • Incremental merging of inverted lists for expansion ti,1...ti,m in descending order of s(tij, d) • Best-match score aggregation • Specialized expansion operators • Incremental Merge operator • Nested Top-k operator (efficient phrase matching) • Boolean (but ranked) retrieval mode • Supports any sorted inverted index for text, structured records & XML SA SA SA Incr. Merge
Incremental Merge Operator t1 d78 0.9 d23 0.8 d10 0.8 d1 0.4 d88 0.3 ... t2 d10 0.7 d64 0.8 d23 0.8 d12 0.2 d78 0.1 ... 0.4 0.72 0.18 t3 d11 0.9 d78 0.9 d64 0.7 d99 0.7 d34 0.6 ... 0.45 0.35 0.9 ~t Thesaurus lookups/ Relevance feedback Index list metadata (e.g., histograms) Initial high-scores Expansion terms ~t = { t1, t2, t3 } Large corpus term correlations sim(t, t1 ) = 1.0 sim(t, t2 ) = 0.9 Expansion similarities sim(t, t3 ) = 0.5 SA d88 0.3 d23 0.8 d10 0.8 d64 0.72 d23 0.72 d10 0.63 d11 0.45 d78 0.45 d1 0.4 d78 0.9 ... Meta histograms seamlessly integrate Incremental Merge operators into probabilistic scheduling and candidate pruning
1 Top-k XPath Processing 2 Probabilistic Index Access Scheduling 3 Probabilistic Candidate Pruning 4 Dynamic Query Expansion 5 Experiments: TREC & INEX Benchmarks
TREC Terabyte Benchmark ’05/’06 • Extensive crawl over the .gov domain (2004) • 25 Mio documents—426 GB text data • 50 ad-hoc-style keyword queries • reintroduction of gray wolves • Massachusetts textile mills • Primary cost metrics • Cost = #SA + cR/cS #RA • Wall clock runtime
TREC Terabyte Cost comparison of scheduling strategies [VLDB 06]
INEX Benchmark ‘06/’07 • New XMLified Wikipedia corpus • 660,000 documents w/ 130,000,000 elements—6.6 GB XML data • 125 NEXI queries, each as content-only (CO) and content-and-structure (CAS) formulation • CO: +“state machine” figure Mealy Moore • CAS: //article[about(., “state machine” )] //figure[about(., Mealy ) or about(., Moore )] • Primary cost metric • Cost = #SA + cR/cS #RA
TopX vs. Full-Merge • Significant cost savings for large ranges of k • CAS cheaper than CO !
Static vs. Dynamic Expansions • Query expansions with up to m=292 keywords & phrases • Balanced amount of sorted vs. random disk access • Adaptive scheduling wrt. cR/cScost ratio • Dynamic expansions outperform static expansions & full-merge in both efficiency & effectiveness
Efficiency vs. Effectiveness • Very good precision/runtime ratio for probabilistic pruning
Official INEX ’06 ResultsRetrieval effectiveness (rank 3-5 out of ~60 submitted runs)
Conclusions & Outlook • Scalable XML-IR and vague search • Mature system, reference engine for INEX topic development & interactive tracks • Efficient and versatile Java prototype for text, XML, and structured data (Oracle backend) • Very efficient prototype reimplementation for text data in C++ (over own file structures) • C++ version for XML currently in production at MPI • More features • Graph top-k, proximity search, XQuery subset,…