1 / 115

Managing XML and Semistructured Data

Managing XML and Semistructured Data. Part 4: Compressing XML Data. In this section. XML Compression Motivation The State-of-the-Art Queriable compressors Non-queriable compressors Resources XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001

hedya
Download Presentation

Managing XML and Semistructured Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing XML and Semistructured Data Part 4: Compressing XML Data

  2. In this section • XML Compression • Motivation • The State-of-the-Art • Queriable compressors • Non-queriable compressors Resources • XMILL: An Efficient Compressor for XML Data by Liefke and Suciu, in SIGMOD'2001 • Others: XGrind, XPress, XQuec, XMLzip, … • XCQ: From my publications • XQZip: From my publications • MQX: From my publications

  3. Introduction • More and more XML data is created • Duplicate structures (tags, paths …) • Data inflation: data in XML is much larger than raw data • Compression: storage and data transfer • General-purpose compressor (e.g. gzip) • Characteristics of XML data not utilized • Unqueriable

  4. Compression: The Problem • XML for exchange (space or time) • But XML is verbose and inflated due to • Duplicated tags and paths • Users prefer application specific formats: • Eg. Web Server Logs • Is XML doomed to fail ? • Solution: XML-specific compressor • Non-queriable: XMill • Queriable: XQzip

  5. XML-Specific Compressors • Unqueriable Compression (e.g. XMill): • Full-chunked: data commonalities eliminated • Very good compression ratio • Queriable Compression (e.g. XGrind, XPRESS): • Fine-grained: data commonalities ignored • Inadequate compression ratio and time • Support simple path queries with atomic predicate

  6. Issues in XML Compression • Compression ratios, Compression time, Query Coverage, Memory Usage…(see my survey paper in WWWJ) Comparison of existing technologies

  7. An Example:Web Server Logs ASCII File 15.9 MB (gzipped 1.6MB): 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) <apache:entry> <apache:host> 202.239.238.16 </apache:host> <apache:requestLine> GET / HTTP/1.0 </apache:requestLine> <apache:contentType> text/html </apache:contentType> <apache:statusCode> 200</apache:statusCode> <apache:date> 1997/10/01-00:00:02</apache:date> <apache:byteCount> 4478</apache:byteCount> <apache:referer> http://www.net.jp/ </apache:referer> <apache:userAgent> Mozilla/3.1$[$ja$]$(I)</apache:userAgent> </apache:entry> XML-ized apache web log inflates to 24.2 MB (gzipped 2.1MB):

  8. XMill • First specialized compressor for XML data • SAX parser for parsing XML data • Still using gzip as its underlying compressor • Clever grouping of data into containers for compression • Compress XML via three basic techniques • Compress the structure separately from the data • Group the data values according to their types • Apply semantic (specialized) compressors: • Downloadable: • www.cs.washington.edu/homes/suciu/XMILL

  9. XMill Architecture:

  10. How Xmill Works: Three Ideas Compress the structure separately from the data: gzip Structure gzip Data 202.239.238.16 GET / HTTP/1.0 text/html 200 … <apache:entry> <apache:host> </apache:host> . . . </apache:entry> =1.75MB +

  11. How Xmill Works: Three Ideas Group the data values according to their types: gzip Structure gzip Data1 gzip Data2 <apache:entry> . . . </apache:entry> 202.23.23.16 224.42.24.55 … GET / HTTP/1.0 GET / HTTP/1.1 … =1.33MB + +

  12. =0.82MB gzip Structure + gzip c1(Data1) + gzip c2(Data2) + ... How Xmill Works: Three Ideas Apply semantic (specialized) compressors: • Examples: • 8, 16, 32-bit integer encoding (signed/unsigned) • differential compressing (e.g. 1999, 1995, 2001, 2000, 1995, ...) • compress lists, records (e.g. 104.32.23.1  4 bytes) • Need user input to select the semantic compressor

  13. Path Processor – structure container: <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> • Replace data value with container number (negative integer) • Replace end tag with 0 • Replace tags/attributes with positive integer Dictionary: One more entry for each new word Fewer storage! 14 bytes! Book = 1, Title = 2, @lang = 3, Author = 4 1 2 3 -1 0 -2 0 4 -3 0 4 -3 0 0 <Book><Title lang=“English”>Data Compression</Title> <Author>Gray</Author> <Author>Reiter</Author> </Book> <Book><Title lang=-1>-2</Title> <Author>-3</Author> <Author>-3</Autor> </Book> <Book><Title lang=-1 0>-2 0 <Author>-3 0 <Author>-3 0 0 Repeated structures entries could be compressed effectively!

  14. XML Compression XMill Evaluation using XML datasets

  15. Queriable Compressors • XQzip: queriable XML compressor (our work [EDBT04]) • Existing XML compressors (survey in[WWWJ05]): • Unqueriable (e.g. XMill [SIGMOD00]): exploit data commonalities ≥ better compression rate than gzip) • Queriable (e.g. XGrind [ICDE02], XPRESS [SIGMOD03], XQueC, XQzip [EDBT04], XCQ [KAISJ05]): compress data individually≥inadequate compression rate and time) • Features of XQzip: • Use the SIT to aid query evaluation • Block-compression: allow data commonalities to be exploited and used as buffers to reduce decompression overhead

  16. Structure Index Tree (SIT) • Effective elimination of duplicate structures in the XML data • Merging of nodes that have • the same incoming path • the same ordered set of paths of their descendants • SIT Construction • A linear scan of the XML document • Merging of the subtree that we are constructing into its equivalent subtree in the base tree

  17. SIT Construction / / 0 0 a a 1 1 b b c c b c c c b 7 2 2 ,7 5 6 5 ,6 7 6 d d d e e d d e e d d 9 3 4 8 10 8 ,10 3 4 ,9 9 10 ,8,10

  18. XQzip Architecture • Index Constructor: construct the SIT • Compressor • Group semantically related items in blocks • Compress each block by gzip • Query Processor: evaluate query • Parser • Executor: apply the SIT to evaluate query • Buffer Manager (By LRU)

  19. SIT Construction Complexity N: Total number of elements in the input XML document • Time Complexity: • Worst-case: O(N │SIT │) • Average-case: O(N) • Space Complexity: • Base tree and the subtree being merged: ≤ 2│SIT │ • Space for storing ids of eliminated nodes: O(N)

  20. Data Compression • A balance between full-chunked and fine-grained compression • A distinct data container for each distinct element • Each container compressed (using gzip) into many smaller blocks • Block size? • Too small: query time ↑compression ratio↓ • Too large: query time ↓compression ratio↑ • Only can be determined by an empirical study

  21. Block Size Representative datasets and queries: • Datasets: • Heavy text • Light text • A mix of heavy text and light text • Queries: • High Selectivity • MediumSelectivity • LowSelectivity

  22. Block Size

  23. Structure of Compressed-Data • Block size? • Determined by an empirical study • Querying Time • near-optimal range : 600-1000 data items/block (average optimal: 950) • Compression Ratio • Not improved much after 150 KB/block (usually contain more than 1000 items) • ≥ 1000 data items/block

  24. Outline • Introduction • XQzip [EDBT 2004] • Indexing • Data Compression • Query Evaluation • Performance Evaluation • Conclusion

  25. XQzip Query Coverage • All XPath axes except the sideways axes (e.g. preceding, following)-siblings • Multiple and nested predicates • and / or / not expressions • Aggregations: sum, count, average, max, min • Group queries: e.g. (L1 (L2 + L3 + L4)) • L1: //a[b = “Crete”] (prefix)L2: c • L3: d[f/count() >100] L4: e[//g]

  26. Query Evaluation • Depth-first traverse the index tree • Buffer Management (LRU) • Why buffering? Decompression Time Dominates • Decompression avoidance

  27. Outline • Introduction • XQzip • Indexing • Data Compression • Query Evaluation • Performance Evaluation • Conclusion

  28. Effectiveness of the SIT

  29. Effectiveness of the SIT • Index Size: less than 1% of original size • Load Time: a fraction of a second • Node Selection Acceleration: twice faster than F&B-Index • Construction Time: more than 3 times faster than F&B-Index

  30. Compression Ratio XQzip is comparable to XMill and gzip, 17% better than XGrind with index size included, 42% better than XGrind without index.

  31. Compression/Decompression Time • XQzip (compression + index construction) is more than 5 times better than XGrind, 1.5 times worse than XMill • XQzip (index-loading + decompression) is more than 3 times better than XGrind, 1.4 times worse than XMill

  32. Query Preformance • Cold Buffer-pool Evaluation • 13 times better than XGrind • Warm buffer-pool Evaluation • 80 times better than XGrind • Impressive Buffer Effect!

  33. Lessons on XML Compression • Good compression ratio and time • Comparable to that of XMill • Much better than that of XGrind (and XPRESS) • Support a very practical set of queries • A much wider range of queries than XGrind and XPRESS • Very Competitive Querying Time with Buffer • 13 time better than XGrind with cold buffer • 80 time better than XGrind with warm buffer • Limitations • Cost of building and maintenance of complex Indexes • No theoretical foundation of block size

  34. XCQ • XCQ Framework • Experimental Results • Compression Performance • Query Performance • Lessons and Development

  35. XCQ • Objectives: • Achieve Good Compression ratio • Comparable to XMill • Better than XGrind • Achieve Good Query performance • More efficient than XGrind • Querying compressed documents with block-based partial decompression • But addressing issues different from XQzip • Adopt minimal indexing • Establish theory between selectivity and block size

  36. XCQ Querying Engine XCQ Compression Engine Query Results Compressed Document DTD XML Document XPath Queries XCQ Strategy • Based on four techniques • DTD Tree and SAX Event Stream Parsing (DSP) • Partition Path-Based Data Grouping (PPB) Format • Block-Statistic Signature (BSS) Indexing • Access Methods PPG format BSS indexing DSP Access Methods

  37. Compressed Document Query Results Technique 1 –DTD Tree and SAX Event Stream Parsing (DSP) PPG format BSS indexing DSP Access Methods XCQ Querying Engine XCQ Compression Engine DTD XML Document XPath Queries

  38. Technique 1 –DTD Tree and SAX Event Stream Parsing (DSP) • Purpose: • To utilize information in the associated DTD of the document • Benefits: • Only encode the information that cannot be inferred in the DTD • Precise path-based grouping of data items • Run in automated manner

  39. A DTD Tree A Structure Stream DSP Module Data Streams A Stream of SAX Events DSP – Input and Output

  40. library entry* publisher? author (name) title year num_copy | Key: paper book : PCDATA course_note DSP Step 1 – Creating a DTD Tree <!ELEMENT library (entry*)> <!ELEMENT entry (author, title, year, publisher?, (paper|course_note|book), num_copy)> <!ELEMENT author EMPTY> <!ATTLIST author name CDATA> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT paper EMPTY> <!ELEMENT course_note EMPTY> <!ELEMENT book EMPTY> <!ELEMENT num_copy (#PCDATA)>

  41. library entry* publisher? author (name) title year num_copy | Key: paper book : PCDATA course_note DSP Step 1 – Creating a DTD Tree <!ELEMENT library (entry*)> <!ELEMENT entry (author, title, year, publisher?, (paper|course_note|book), num_copy)> <!ELEMENT author EMPTY> <!ATTLIST author name CDATA> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT paper EMPTY> <!ELEMENT course_note EMPTY> <!ELEMENT book EMPTY> <!ELEMENT num_copy (#PCDATA)>

  42. DSP Step 2 – Processing in DSP Module • How does the DSP module process the following XML document? <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library>

  43. <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “library” Structure Stream: library entry* author (name) publisher? | title year num_copy Data Streams: Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

  44. T <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “entry” Match! Structure Stream: library entry* author (name) publisher? | title year num_copy Data Streams: Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

  45. <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “author”, att0:name=“Tom” End element – “author” d0 Structure Stream: library T , d0 Match! entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

  46. <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “title” PCDATA – “Introduction to &#34;OS &#34;” End element – “title” Structure Stream: library T, d0, d1 entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to &#34;OS &#34; Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

  47. Not match! F <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: SAX Events: Start element – “year” PCDATA – “2003” End element – “year” Start element – “course_note” Structure Stream: library T, d0, d1, d2 , F entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to &#34;OS &#34; d2: 2003 Keys: paper book : Traversal path course_note : PCDATA : Processing DTD tree node

  48. Not match! p1 Keys: : Traversal path : PCDATA : Processing DTD tree node <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “course_note” End element – “course_note” Structure Stream: library T, d0, d1, d2, F , p1 Match! entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom p0 p2 d1: Introduction to &#34;OS &#34; p1 d2: 2003 paper book course_note

  49. Keys: : Traversal path : PCDATA : Processing DTD tree node <library> <entry> <author name="Tom"/> <title>Introduction to &#34;OS&#34;</title> <year>2003</year> <course_note/> <num_copy>3</num_copy> </entry> </library> SAX Event: Start element – “num_copy” PCDATA – “3” End element – “num_copy” End element – “entry” Structure Stream: library T, d0, d1, d2, F, p1 entry* author (name) publisher? | title year num_copy Data Streams: d0: Tom d1: Introduction to &#34;OS &#34; d2: 2003 paper book d4: 3 course_note

More Related