1 / 24

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing. Martin Theobald Max-Planck Institute. Mohammed AbuJarour Hasso-Plattner Institute. Ralf Schenkel Max-Planck Institute. article. article. title. title. “ Current Approaches to XML Data Manage-

gaurav
Download Presentation

TopX 2.0 — A (Very) Fast Object-Store for Top-k XPath Query Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TopX 2.0—A (Very) Fast Object-Store for Top-k XPath Query Processing Martin Theobald Max-Planck Institute Mohammed AbuJarour Hasso-Plattner Institute Ralf Schenkel Max-Planck Institute

  2. article article title title “Current Approaches to XML Data Manage- ment” “The XML Files” bib sec sec sec sec bib title title “The Ontology Game” title “Native XML Data Bases.” item “The Dirty Little Secret” par par item title “Native XML data base systems can store schemaless data ... ” “XML queries with an expres- sive power similar to that of Datalog …” par “XML” par “Sophisticated technologies developed by smart people.” url “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …” “w3c.org/xml” par inproc par title “XML-QL: A Query Language for XML.” “Proc. Query Languages Workshop, W3C,1998.” “What does XML add for retrieval? It adds formal ways …” “Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files …” //article[about(.//bib//item, “W3C”)] //sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML databases”)] RANKING VAGUENESS EARLY PRUNING From the INEX ’03-’05 IEEE Collection

  3. Frontends • Web Interface • Web Service • API 2.0 TopX 1.0 Query Processor Non-conjunctive Top-k XPath Query Processing Top-k Queue Candidate Queue SA SA SA Probabilistic Index Access Scheduling Random Access Scan Threads Sequential Access Probabilistic Candidate Pruning Expensive Predicates • Path Conditions • Phrases & Proximity • Other Full-Text Op’s RA Dynamic Query Expansion JDBC Index Metadata • Selectivities • Histograms • Correlations Ontology/ Large Thesaurus WordNet, OpenCyc, etc. Relational DBMS Backend Unified Text & XML Schema RA Indexer/Crawler

  4. article 1 6 title abs sec 2 2 1 3 4 5 “xml data manage” “xml manage system vary wide expressive power“ title par 5 3 6 4 “native xml data base” “native xml data base system store schemaless data“ ftf (“xml”, article1 ) = 4 Data Model “xml data manage xmlmanage system vary wide expressive power native xml native xmldata base system store schemaless data“ “xml data manage xml manage system vary wide expressive power native xml data base native xml data base system store schemaless data“ <article> <title>XML Data Management </title> <abs>XML management systems vary widely in their expressive power. </abs> <sec> <title>Native XML Data Bases. </title> <par>Native XML data base systems can store schemaless data. </par> </sec> </article> “native xml data base native xml data base system store schemaless data“ “native xml data base native xml data base system store schemaless data“ ftf (“xml”, sec4 ) = 2 • XML trees (no XLink/ID/IDRef) • Pre-/postorder ranges for the structural index • Redundant full-content text nodes

  5. Scoring Model [INEX ‘05/’06/’07/’08] Content Index (Tag-Term Pairs) Element Freq. Element Statistics • XML-specific variant of Okapi BM25 (originating from probabilistic IR on unstructured text) author[“gates”] vs. section[“gates”]

  6. TopX 1.0: Relational Schema • Precompute & materialize scoring model into combined inverted index over tag-term pairs • Supports sortedaccess (by descending MaxScore) and random access (by DocID) sec[“xml”]  Typically two B+trees in a DBMS Select DocID, Pre, Post, Score From TagTermIndex Where tag=‘sec’ and term=‘xml’ Order by MaxScore desc, DocID desc Pre asc, Post Desc SA Select Pre, Post, Score From TagTermIndex Where DocID=3 and tag=‘sec’ and term=‘xml’ Order by Pre Asc, Post Desc RA

  7. Top-k XPath over a Relational Schema[TopX, VLDB ’05 & VLDB-J(1) ’08] • Content-only (CO) & “structure enriched” queries: //sec[about(.//, “XML”) and about(.//title, “native”]//par[about(.//, “retrieval”)] sec[“xml”] title[“native”] par[“retrieval”] • Sequentially scan each index list in descending order of MaxScore • Hash-join element blocks by DocID in-memory • Do “some” incremental XPath evaluation using Pre/Post indices • Aggregate Score along connected path fragments • Use variant of Fagin’s threshold algorithm for top-k-style early termination

  8. Top-k XPath over a Relational Schema[TopX, VLDB ’05 & VLDB-J(1) ’08] • Content-and-structure (CAS) queries: //article//sec[about(.//, “XML”)] sec[“xml”] article 1.0 SA RA • Expensive predicate probes (RA)to the structure index (3rd B+tree) • Non-conjunctive XPathevaluations • Dynamically relax content- & structure-related query conditions • (top-k results entirely driven by score aggregations for content & structure cond.’s) Select Pre, Post From TagIndex Where DocID=2123 and Tag=‘article’ Order by Pre asc, Post desc

  9. Relational Schema (ct’d) • No shredding into DTD-specific relational schema! • No DTD at all for INEX Wikipedia! sec[“xml”] article 1,107 distinct tags 20,810,942 distinct tag-term pairs for the 4.38 GB Wikipedia collection

  10. TopX 1.0: Top-k XPath over a Relational Schema Content Index Structure Index (4+4+4+4+4+4+4) bytes X 567,262,445 tag-term pairs (4+4+4+4) bytes X 52,561,559tags ≈ 16 GB ≈ 0.85 GB • 2-dimensional source of redundancy • Full-content scoring model (red. factor ≈ avg. depth of a text node  6.7 for INEX-Wiki) • De-normalized relational schema, many redundant attributes • High overhead in the architecture (Java->JDBC->DBMS & back) • Element-block sizes are data-driven, not easy to control layout on disk • Hashing too slow compared to very efficient in-memory merge-joins

  11. TopX 2.0: Object-Oriented Storage sec[“xml”] Binary file 0 2 DocID Element Block MaxSore DocID B 1 MaxSore title[“xml”] 122,564 L 17 (4+4+4+4+4+4+4) X 567,262,445 Relational: ≈16 GB 3 B par[“xml”] 4 X 456,466,649 432,534 L + (4+4+4) X 567,262,445 … Object-oriented: ≈ 8.6 GB (still uncompressed) B– Element block separator L– Index list separator (+ (4+4) X 20,810,942 = 166 MB for the offset index)

  12. Object-Oriented Storage w/Block-Merging sec[“xml”] 0 Element Block 1 • Group element blocks with similar MaxScore into document blocks of bounded length (e.g. < 256KB) • Sort element blocks within each document block by DocID • Supports • Sorted access by MaxScore • Merge-joins by DocID • Raw disk access B 2 MaxSore Document Block < 256KB SA B 5 title[“xml”] B B B B 122,564 3 B 6 MaxSore … L …

  13. //sec[about(.//, “XML”)] //par[about(.//, “retrieval”)] Merging Document BlocksIncrementally sec[“xml”] par[“retrieval”] Max(MaxScore): 0.9 1.0 1 2  Sequential access and efficient merge-joins on top of large document blocks SA B B 2 5 B B 7 5 B B B B B B B B 0.6 0.8 3 6 B B 6 9 … …

  14. Compressed Number Encoding • Multi-attribute (=4) double-nested block-index structure • Delta encoding only works for DocID(and to some extent for Pre) • No specific assumptions on distributions of Pre/Post or Score • No Unary or Huffman coding (prefix-free but additional coding table) • Sophisticated compression schemes may be expensive to decode • No Zip, etc.; not even PFor-Delta (needs second pass for each attribute type) • But have known number ranges • DocID [1, 659,388] -> 3 bytes (2543 = 16,387,064, lossless) • Pre/Post [1, 43,114] -> 2 bytes (2562 = 64,516, lossless) • Score [0,1] -> rounded to 1 byte (256 buckets, lossy) • Variable-length byte encoding w/leading length-indicator byte  4+1=5 bytes  9+1=10 bytes

  15. Some more tricks… • Dump leading histogram blocks into index list headers • Histograms only for index lists that exceed one document block (<5% of all lists) • Supports probabilistic pruning and cost-based index access scheduling[IO-Top-K, VLDB ’06] • Incrementally read & process precomputed memory images for fast top-k queries on top of large disk blocks 36 bytes sec[“xml”] DB1(256 KB) DB2(256 KB) DBl (256 KB) Histogram Block freq EB 1 EB 2 … EB k … … 1 0 score

  16. 1.0 1.0 1.0 0.9 0.9 0.9 Δ3,3 = 0.2 0.7 0.9 0.8 0.2 0.6 0.8 … … … Block Access Scheduling [IO-Top-K,VLDB ’06] • SA Scheduling • Look-ahead Δi through precomputed score histograms • Knapsack-based optimization of Score Reduction • RA Scheduling • 2-phase probing: Schedule RAs “late & last” i.e., cleanup the queue if • Extended probabilistic cost model for integrated SA & RA scheduling Inverted Block-Index (256KB doc-blocks) SA SA SA Δ1,3 = 0.8 RA

  17. Object Storage Summary • 567,262,445 tag-term pairs • 20,810,942 distinct tag-term pairs • 20,815,884 document blocks (<256KB) • 456,466,649 element blocks • 3,729,714,594 total bytes (3.47GB) (6.57 bytes/tag-term pair on avg.) Content Index (incl. histograms) Structure Index • 52,561,559 tags (elements) • 1,107 distinct tags • 2,323 document blocks (<256KB) • 8,999,193 element blocks • 205,021,938 total bytes (195MB) • (3.9 bytes/tag on avg.) From 4.38 GB Wikipedia XML sources

  18. Efficiency Track Results – Focused, All All experiments: AMD Opteron quad-core 2.6 GHz, 16 GB RAM, RAID 5, Windows Server 2003 566/568 efficiency topics (CO & CAS)

  19. Efficiency Track Results – Focused, Type (A) 538/540 type (A) efficiency topics (CO & CAS)

  20. Efficiency Track Results – Focused, Type (B) 21/21 type (B) efficiency topics (CO & CAS)

  21. Efficiency Track Results – Focused, Type (C) 7/7 type (C) efficiency topics (CO & CAS)

  22. Efficiency Track Results – Thorough, All Note: top-15 only! 566/568 efficiency topics (CO & CAS)

  23. Conclusions & Outlook • Scalable and efficient XML-IR with vague search • TopX 1.0 our mature system, default engine for INEX topic development & interactive tracks [VLDB-J Special Issue on DB&IR Integration ‘08] • Brand-new TopX 2.0 prototype • Efficient reimplementation in C++, object-oriented XML storage, moderate compression rates • 20—30 times better sequential throughput than relational • Can do CAS in 0.05 sec avg. & CO in 0.02 sec avg. (classic ad-hoc topics) and CAS in 0.09 sec avg. & CO in 0.05 sec avg. (incl. difficult topics) • More features • Generalized proximity search, graph top-k • Updates (gaps within document blocks) • XQuery Full-Text (top-k-style bounds over IF, For-Let) • …

  24. http://www.inex.otago.ac.nz/efficiency/efficiency.asp

More Related