230 likes | 338 Views
TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing. Martin Theobald Stanford University. Mohammed AbuJarour Hasso-Plattner Institute. Ralf Schenkel Max-Planck Institute. article. article. title. title. “ Current Approaches to XML Data Manage-
E N D
TopX 2.0—A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Stanford University Mohammed AbuJarour Hasso-Plattner Institute Ralf Schenkel Max-Planck Institute
article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[.//bib[about(.//item, “W3C”)] ]//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS EARLY PRUNING From the INEX ’03-’05 IEEE Collection
Frontends • Web Interface • Web Service • API 2.0 TopX 1.0 Query Processor Non-conjunctive Top-k XPath Query Processing Top-k Queue Candidate Queue SA SA SA Probabilistic Index Access Scheduling Random Access Scan Threads Sequential Access Probabilistic Candidate Pruning Expensive Predicates • Path Conditions • Phrases & Proximity • Other Full-Text Op’s RA Dynamic Query Expansion JDBC Index Metadata • Selectivities • Histograms • Correlations Ontology/ Large Thesaurus WordNet, OpenCyc, etc. Relational DBMS Backend Unified Text & XML Schema RA Indexer/Crawler
article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ ftf (“xml”, article1 ) = 4 Data Model “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> “native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ ftf (“xml”, sec4 ) = 2 • XML trees (no XLink/ID/IDRef) • Pre-/postorder ranges for the structural index • Redundant full-content text nodes
Scoring Model [INEX ‘05/’06/’07] Content Index (Tag-Term Pairs) Element Freq. Element Statistics • XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text) bib[“transactions”] vs. par[“transactions”]
TopX 1.0: Relational Schema • Precompute & materialize scoring model into combined inverted index over tag-term pairs • Supports sortedaccess (by MaxScore) and random access (by DocID) sec[“xml”] Two B+trees Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScoredesc, DocIDdesc Pre asc, Post Desc SA Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc RA
Top-k XPath on a Relational Schema [VLDB ’05] • Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//, “retrieval”)] sec[“xml”] title[“native”] par[“retrieval”] • Sequentially (mostly) scan each index list in desc. order of MaxScore • Hash-join element blocks by DocIDin-memory • Do “some” incremental XPath evaluation using Pre/Post indices • Aggregate Score along connected path fragments • Use variant of Fagin’s threshold algorithm for top-k-style early termination
Top-k XPath on a Relational Schema [VLDB ’05] • Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)] sec[“xml”] article 1.0 SA RA • Expensive predicate probes (RA)to the structure index (3rd B+tree) • Non-conjunctive XPathevaluations • Dynamically relax content- & structure-related query conditions • (top-k results entirely driven by score aggregations for content & structure cond.’s) Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’ Order by Pre asc, Post desc
Relational Schema (cont’d) • No shredding into DTD-specific relational schema! • No DTD at all for INEX Wikipedia! sec[“xml”] article 1,107 distinct tags 20,810,942 distinct tag-term pairs for 4.38 GB Wikipedia collection
Relational Schema (cont’d) Content Index Structure Index (4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs (4+4+4+4) bytes X 52,561,559tags 16 GB 0.85 GB • 2-dimensional source of redundancy • Full-content scoring model (#terms times avg. depth of a text node 6.7 for INEX Wiki) • De-normalized relational schema • High overhead in the architecture (Java->JDBC->DBMS & back) • Element-block sizes are data-driven, not easy to control layout on disk • Hashing too slow compared to very efficient in-memory merge-joins
TopX 2.0: Object-Oriented Storage sec[“xml”] Binary file 0 2 DocID MaxSore DocID B 1 MaxSore title[“xml”] 122,564 L 17 (4+4+4+4+4+4+4) X 567,262,445 Relational: 16 GB 3 B par[“xml”] 4 X 456,466,649 432,534 L + (4+4+4) X 567,262,445 … Object-oriented: 8.6 GB B– Element block separator L– Index list separator (+ (4+4) X 20,810,942 = 166 MB for the offset index)
Object-Oriented Storage w/Block-Merging sec[“xml”] 0 1 • Group element blocks with similar MaxScore into document blocks of fixed length (e.g. 256KB) • Sort element blocks within each document block by DocID • Supports • Sorted access by MaxScore • Merge-joins by DocID • Raw disk access B 2 MaxSore Document Block B 5 title[“xml”] B B B… B… B B 122,564 3 B 6 MaxSore … L …
//sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] Merging Document Blocks sec[“xml”] par[“retrieval”] 0.8 1.0 1 2 Sequential access and efficient merge-joins on top of large document blocks SA B B 2 5 B B 7 5 B B B B B… B… B… B… B B B B 0.7 0.8 3 6 B B 6 9 … …
Compressed Number Encoding • Multi-attribute (4), double-nested block-index structure • Delta encoding only works for DocID(and to some extent for Pre) • No specific assumptions on distributions of Pre/Post or Score • No Unary or Huffman coding (prefix-free but additional coding table) • Sophisticated compression schemes may be expensive to decode • No Zip, etc. • But known number ranges • DocID [1, 659,388] -> 3 bytes (2543 = 16,387,064, lossless) • Pre/Post [1, 43,114] -> 2 bytes (2542 = 64,516, lossless) • Score [0,1] -> rounded to 1 byte (254 buckets, lossy) • Variable-length byte encoding w/leading length-indicator byte 5 bytes 10 bytes
Some more tricks… • Dump leading histogram blocks into index list headers • Histograms only for index lists that exceed one document block (<5% of all lists) • Own native compare methods for DocID, Pre/Post • Decode only Score for arithmetic op’s ( Mostly perform pointer operations at qp time) • Incrementally read & process precomputed memory image for fast top-k queries on top of large disk blocks 36 bytes sec[“xml”] DB1(256 KB) DB2(256 KB) DBl(256 KB) Histogram Block freq EB 1 EB 2 … EB k … … 1 0 score
1.0 1.0 1.0 0.9 0.9 0.9 Δ3,3 = 0.2 0.7 0.9 0.8 0.2 0.6 0.8 … … … Block Access Scheduling [VLDB ’06] • SA Scheduling • Look-ahead Δi through precomputed score histograms • Knapsack-based optimization of Score Reduction • RA Scheduling • 2-phase probing: Schedule RAs “late & last” i.e., cleanup the queue if • Extended probabilistic cost model for integrated SA & RA scheduling Inverted Block-Index (256KB doc-blocks) SA SA SA Δ1,3 = 0.8 RA
Object Storage Summary • 567,262,445 tag-term pairs • 20,810,942 distinct tag-term pairs • 20,815,884 document blocks (256KB) • 456,466,649 element blocks • 4,703,385,686 total bytes (8.3 bytes/tag-term pair) Content Index (incl. histograms) Structure Index • 52,561,559 tags (elements) • 1,107 distinct tags • 2,323 document blocks (256KB) • 8,999,193 element blocks • 246,601,752 total bytes • (4.7 bytes/tag) 4.38 GB Wikipedia XML sources
Preliminary Runtime Experiments CO (top-10, non-conjunctive)
Preliminary Runtime Experiments CAS (top-10, non-conjunctive)
Some INEX Results CAS (top-1,500, non-conjunctive)
Some INEX Results CAS (top-1,500, non-conjunctive)
Conclusions & Outlook • Scalable and efficient XML-IR with vague search • Mature system, reference engine for INEX topic development & interactive tracks [VLDB Special Issue on DB&IR Integration ‘08] • Brand-new TopX 2.0 prototype • Very efficient reimplementation in C++ • Object-oriented XML storage, moderate compression rates • 10—20 times better sequential throughput than relational • More features • Generalized proximity search, graph top-k • Updates (gaps within document blocks) • XQuery Full-Text (top-k-style bounds over IF, For-Let) • …