1 / 49

XML Data Compression

XML Data Compression. Greg Leighton, Jim Diamond, Tomasz Müldner February 18, 2005. Overview. A (brief) introduction to data compression XML lossless data compression New XML Compression Programs AXECHOP TREECHOP. XML Data Compression. A (brief) introduction to XML

verdad
Download Presentation

XML Data Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XML Data Compression Greg Leighton, Jim Diamond, Tomasz Müldner February 18, 2005

  2. Overview • A (brief) introduction to data compression • XML lossless data compression • New XML Compression Programs • AXECHOP • TREECHOP

  3. XML Data Compression • A (brief) introduction to XML • Techniques for achieving XML compression • Traditional approaches – Huffman, LZ • Specialized approaches • XML Compression Programs • XMill • XGrind • XPRESS

  4. eXtensible Markup Language • separate syntax from semantics • support semi-structured data • support internationalization and platform independence • is self-describing (through labeling of the tree)

  5. eXtensible Markup Language : 2 XML is a framework for defining markup languages: • no fixed collection of markup tags • each XML language is specialized for its own application domain • a common set of generic tools supports processing documents XML: textual convention to represent tagged trees

  6. eXtensible Markup Language : 3 <?xml version=“1.0” encoding=“UTF-8”?> <Employees> <Employeeid=“123456”> <Name>Homer Simpson</Name> <Department>Sector 7-G</Department> </Employee> <Employeeid=“123457”> <Name>Frank Grimes</Name> <Department>Sector 7-G</Department> </Employee> … </Employees> Element Attribute Data Value

  7. eXtensible Markup Language : 4 • Correctness of an XML document: • Well-formed: complies with XML syntax • Valid: obeys the structure described in a grammar, such as XML schema document • Two kinds of XML parsers: • SAX • DOM

  8. Why Compress XML? XML is verbose: • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag> • Ordering of tags is often repeated in a document (e.g. multiple records) • Tag names are often long

  9. XML Compressors • View XML as a tree • Separate the tree structure and what is stored in leaves • Save the tree structure so that it can be restored • The compressed file may or may not remember the tree structure breadfruit tree

  10. Book T1 T2 Title Author T4 Author T4 @lang T3 Miller C5 Tai C5 Views C4 English C3 <Book><Title lang="English">Views</Title> <Author>Miller</Author> <Author>Tai</Author> </Book> T1 T2 T3 C3 / C4 / T4 C5 / T4 C5 / / T1 T2 T3 C3 / C4 / T4 C5 / T4 C5 / / XMill: Liefke and Suciu • Tree structure: • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.) • End tags replaced with ‘/’ token • Data values are replaced with their container number

  11. XMill: Liefke and Suciu • For each ending tag or attribute there is a separate data container • Different semantic compressors can be used for various containers (gzip) • Compressed file does not remember the original structure

  12. XMill: Decompression • Decompressor loads and unzips each container and the decompressed structure container is parsed • Whenever a data value is found in the structure, the next value is pulled from the corresponding data container and the appropriate semantic decompressor (if applicable) is applied to get back the original data value

  13. XGrind: Tolani & Haritsa • The structure of original XML document is retained by the compression process:Compress at the granularity of individual element and attribute values.

  14. XGrind • Operations on the compressed file • Querying the compressed document • Exact and prefix-match require no decompression • Range or partial-match require on-the-fly decompression of element/attribute values that appear in the query • Updates • Testing Validity (against the compressed DTD)

  15. XGrind: Implementation • A context-free compression must be used(the code assigned to a string does not depend on the location of this string). There are several types of compressors: • Tags (as in XMill) • Enumerated values (simple compressor, uses DTD) • Element/attribute value compressor:Non-adaptive Huffman coding scheme. A separate Huffman tree is calculated for every non-enumerated data element

  16. <Book><Title lang="English">Views</Title> <Author>Miller</Author> <Author>Tai</Author> </Book> T1 T2 A1 nh(English) / nh(Views) / T4 nh(Miller) / T4 nh(Tai) / / [ nh(s): output from Huffman compressor for s ] Book T1 T2 Title Author T4 Author T4 @lang A1 Miller C5 Tai C5 Views C4 English C3 XGrind • Tree structure: • Start tags and attribute names are dictionary-encoded (as T1, T2, for tags and A1, A2 for attributes) • End tags replaced with ‘/’ token

  17. XGrind: Querying • The query engine works on the compressed document; it consists of a lexer and a parser • The query (the path and the predicate) are compressed • The parser checks if the current path matches the query path and the compressed data value satisfy the compressed predicate

  18. XMLPPM • A modification of the prediction-by-partial matching (PPM) text compression scheme. • To process XML data, the encoder chooses the appropriate PPM model from a set of several models depending on the current context supplied by the built-in SAX parser

  19. AXECHOP and TREECHOP • AXECHOP: attempts to achieve highest-possible compression ratio by reordering original document • TREECHOP: willing to sacrifice a bit on compression ratio in order to preserve original XML structure and enable querying to be carried out on compressed document

  20. AXECHOP: Key Features • Uses a grammar-based approach for compressing XML document structure • Outperforms general-purpose text compressors (e.g. gzip) by as much as 30% on XML • Operates offline (decompression can’t start until entire compressed file has been received) • Suited for XML data archiving, not for XML messaging applications (e.g. Web Services)

  21. Grammar-based Compression • Achieves compression by producing a context-free grammar that uniquely derives the input sequence • Define a separate production for each repetition in the input • For second and subsequence occurrences, encode the LHS of the production rather than the pattern on the RHS

  22. Grammar-based Compression: Example Original Input: abcdbcabc Generated Grammar: S  aAdAaA A  bc

  23. AXECHOP: Compression Strategy • Perform a re-ordering of the XML document during SAX parsing • Use a byte-based encoding scheme to record the structure of the document – the “structure string” • Place data values for each element and attribute in a separate container to localize repetitions

  24. AXECHOP: Compression Strategy 3 • Apply Multilevel Pattern Matching (MPM) algorithm to obtain grammar-based compression of the document structure • Compress the contents of each data value container using the Burrows-Wheeler block-sorting algorithm • Write compressed data to output file

  25. AXECHOP Compression: Example

  26. AXECHOP Compression: Example 3 Original Structure String: 1 132 2 128 130 3 4 133 128 130 4 133 128 130 130 130 MPM-Generated Grammar:

  27. AXECHOP: Decompression Strategy • Decompress the MPM code to obtain the document structure • Perform inverse BWT to get back the contents of each data value container • Perform a single pass through the reconstituted structure string

  28. AXECHOP: Implementation • Written in C++ • Designed to be modular • instead of using MPM as structural compressor, can insert a different compressor • BWT can be swapped with a different container compressor

  29. AXECHOP: Experimental Results

  30. AXECHOP: Conclusions • AXECHOP achieves 2nd best average compression rate over a varied corpus of XML files • Future work: • Speed up compression through code optimization • Use a form of PPM in place of BWT for dictionary compression (PPM often achieves a better compression rate but tends to be slow) • Define an XML-conscious grammar-based compression scheme, instead of using “general-purpose” MPM

  31. TREECHOP: Key Features • Carries out an online compression of the XML document tree • Since original document structure is maintained throughout compression, querying can be carried out without requiring decompression • Intended for XML messaging scenarios, where documents are being transmitted over a network • Encoding and querying strategies are based on the XPath standard

  32. XPath • A W3C standard for identifying particular nodes of an XML document tree • Syntax is similar to that used for pathnames in UNIX <class> <students> <instructor> <student> In XPath: /class/students/student

  33. TREECHOP: Compression Strategy • Perform SAX parsing • Generate document tree • Assign a binary code word to each non-leaf tree node • Write tree encoding to compression stream

  34. TREECHOP: Generating the Tree • Encoding has 3 important properties: • Each tree node inherits its parent’s code as a prefix • Two nodes share the same code word iff they have the same XPath location, as traced from tree root downwards • Maintains the structure of the original document throughout the compression process

  35. TREECHOP: Tree Encoding Scheme • Given a non-leaf node N with parent node P, where N is the i-th distinct child node of P:

  36. TREECHOP: Example XML File

  37. TREECHOP: Example Tree <class> @name <instructor> <students> “COMP 5113” “Bob Smith” <student> <student> @id @id “Pete Wilson” “Lola Richardson” “100000” “100001”

  38. TREECHOP: Example Tree 00 <class> 00011 0000 00010 @name <instructor> <students> “COMP 5113” “Bob Smith” 0001100 0001100 <student> <student> 000110000 000110000 @id @id “Pete Wilson” “Lola Richardson” “100000” “100001”

  39. TREECHOP: Writing the Tree • Encoded tree is written to compression stream in depth-first order • Non-leaf nodes: written as 4-tuple (L, C, T, D) • L is a byte indicating bit length of code word • C is a sequence of bytes containing code word • T is a byte indicating node type (e.g. element) • D is textual data stored in the node (e.g. element name) - reserved byte values are used to signal beginning/end of stream of raw character data

  40. TREECHOP: Writing the Tree 2 • For 2nd and subsequent occurrences of a non-leaf node, only the 2-tuple (L, C) is transmitted – decoder can then infer T and D • Leaf nodes are written in the manner of D, above – as a stream of raw character data

  41. TREECHOP: Decompression Strategy • A code table is used to keep track of code words processed thus far • Allows future occurrences of a particular (L, C) pair to be mapped to the proper data type & value

  42. TREECHOP: Decompression Strategy 2 • To maintain proper nesting, a stack is used • When a new tree node is processed, continue popping until the node on top of stack does not share a common code word prefix with current node • Last popped node is the parent of the current node

  43. TREECHOP: Querying • Queries are expressed using XPath • Once the equivalent code word for the query predicate has been determined, query matches can be quickly located by searching through the compression stream for other occurrences of the code word

  44. TREECHOP: Querying Example

  45. TREECHOP: Querying Example 2Search for all occurrences of ‘/class/students/student’ • Discover code word for ‘/class’  00 • Discover code word for ‘/class/students’  00011 • Discover code word for ‘/class/students/student’  0001100 • Extract data contained in next occurring leaf node  “Pete Wilson” • Scan through remainder of compression stream, looking for occurrences of code word 0001100 – occurs once more and the associated data value (“Lola Richardson”) is extracted

  46. TREECHOP: Current State • Java-based implementation is partially completed • Modeled after existing java.net package (e.g. XMLSocket corresponds to Socket)

  47. Future Work • Finish implementation of TREECHOP • Use TREECHOP to validate compressed document using compressed grammars • Applications for XML filtering • Compressed stylesheets

  48. Questions?

More Related