500 likes | 750 Views
XML Data Compression. Greg Leighton, Jim Diamond, Tomasz Müldner February 18, 2005. Overview. A (brief) introduction to data compression XML lossless data compression New XML Compression Programs AXECHOP TREECHOP. XML Data Compression. A (brief) introduction to XML
E N D
XML Data Compression Greg Leighton, Jim Diamond, Tomasz Müldner February 18, 2005
Overview • A (brief) introduction to data compression • XML lossless data compression • New XML Compression Programs • AXECHOP • TREECHOP
XML Data Compression • A (brief) introduction to XML • Techniques for achieving XML compression • Traditional approaches – Huffman, LZ • Specialized approaches • XML Compression Programs • XMill • XGrind • XPRESS
eXtensible Markup Language • separate syntax from semantics • support semi-structured data • support internationalization and platform independence • is self-describing (through labeling of the tree)
eXtensible Markup Language : 2 XML is a framework for defining markup languages: • no fixed collection of markup tags • each XML language is specialized for its own application domain • a common set of generic tools supports processing documents XML: textual convention to represent tagged trees
eXtensible Markup Language : 3 <?xml version=“1.0” encoding=“UTF-8”?> <Employees> <Employeeid=“123456”> <Name>Homer Simpson</Name> <Department>Sector 7-G</Department> </Employee> <Employeeid=“123457”> <Name>Frank Grimes</Name> <Department>Sector 7-G</Department> </Employee> … </Employees> Element Attribute Data Value
eXtensible Markup Language : 4 • Correctness of an XML document: • Well-formed: complies with XML syntax • Valid: obeys the structure described in a grammar, such as XML schema document • Two kinds of XML parsers: • SAX • DOM
Why Compress XML? XML is verbose: • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag> • Ordering of tags is often repeated in a document (e.g. multiple records) • Tag names are often long
XML Compressors • View XML as a tree • Separate the tree structure and what is stored in leaves • Save the tree structure so that it can be restored • The compressed file may or may not remember the tree structure breadfruit tree
Book T1 T2 Title Author T4 Author T4 @lang T3 Miller C5 Tai C5 Views C4 English C3 <Book><Title lang="English">Views</Title> <Author>Miller</Author> <Author>Tai</Author> </Book> T1 T2 T3 C3 / C4 / T4 C5 / T4 C5 / / T1 T2 T3 C3 / C4 / T4 C5 / T4 C5 / / XMill: Liefke and Suciu • Tree structure: • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.) • End tags replaced with ‘/’ token • Data values are replaced with their container number
XMill: Liefke and Suciu • For each ending tag or attribute there is a separate data container • Different semantic compressors can be used for various containers (gzip) • Compressed file does not remember the original structure
XMill: Decompression • Decompressor loads and unzips each container and the decompressed structure container is parsed • Whenever a data value is found in the structure, the next value is pulled from the corresponding data container and the appropriate semantic decompressor (if applicable) is applied to get back the original data value
XGrind: Tolani & Haritsa • The structure of original XML document is retained by the compression process:Compress at the granularity of individual element and attribute values.
XGrind • Operations on the compressed file • Querying the compressed document • Exact and prefix-match require no decompression • Range or partial-match require on-the-fly decompression of element/attribute values that appear in the query • Updates • Testing Validity (against the compressed DTD)
XGrind: Implementation • A context-free compression must be used(the code assigned to a string does not depend on the location of this string). There are several types of compressors: • Tags (as in XMill) • Enumerated values (simple compressor, uses DTD) • Element/attribute value compressor:Non-adaptive Huffman coding scheme. A separate Huffman tree is calculated for every non-enumerated data element
<Book><Title lang="English">Views</Title> <Author>Miller</Author> <Author>Tai</Author> </Book> T1 T2 A1 nh(English) / nh(Views) / T4 nh(Miller) / T4 nh(Tai) / / [ nh(s): output from Huffman compressor for s ] Book T1 T2 Title Author T4 Author T4 @lang A1 Miller C5 Tai C5 Views C4 English C3 XGrind • Tree structure: • Start tags and attribute names are dictionary-encoded (as T1, T2, for tags and A1, A2 for attributes) • End tags replaced with ‘/’ token
XGrind: Querying • The query engine works on the compressed document; it consists of a lexer and a parser • The query (the path and the predicate) are compressed • The parser checks if the current path matches the query path and the compressed data value satisfy the compressed predicate
XMLPPM • A modification of the prediction-by-partial matching (PPM) text compression scheme. • To process XML data, the encoder chooses the appropriate PPM model from a set of several models depending on the current context supplied by the built-in SAX parser
AXECHOP and TREECHOP • AXECHOP: attempts to achieve highest-possible compression ratio by reordering original document • TREECHOP: willing to sacrifice a bit on compression ratio in order to preserve original XML structure and enable querying to be carried out on compressed document
AXECHOP: Key Features • Uses a grammar-based approach for compressing XML document structure • Outperforms general-purpose text compressors (e.g. gzip) by as much as 30% on XML • Operates offline (decompression can’t start until entire compressed file has been received) • Suited for XML data archiving, not for XML messaging applications (e.g. Web Services)
Grammar-based Compression • Achieves compression by producing a context-free grammar that uniquely derives the input sequence • Define a separate production for each repetition in the input • For second and subsequence occurrences, encode the LHS of the production rather than the pattern on the RHS
Grammar-based Compression: Example Original Input: abcdbcabc Generated Grammar: S aAdAaA A bc
AXECHOP: Compression Strategy • Perform a re-ordering of the XML document during SAX parsing • Use a byte-based encoding scheme to record the structure of the document – the “structure string” • Place data values for each element and attribute in a separate container to localize repetitions
AXECHOP: Compression Strategy 3 • Apply Multilevel Pattern Matching (MPM) algorithm to obtain grammar-based compression of the document structure • Compress the contents of each data value container using the Burrows-Wheeler block-sorting algorithm • Write compressed data to output file
AXECHOP Compression: Example 3 Original Structure String: 1 132 2 128 130 3 4 133 128 130 4 133 128 130 130 130 MPM-Generated Grammar:
AXECHOP: Decompression Strategy • Decompress the MPM code to obtain the document structure • Perform inverse BWT to get back the contents of each data value container • Perform a single pass through the reconstituted structure string
AXECHOP: Implementation • Written in C++ • Designed to be modular • instead of using MPM as structural compressor, can insert a different compressor • BWT can be swapped with a different container compressor
AXECHOP: Conclusions • AXECHOP achieves 2nd best average compression rate over a varied corpus of XML files • Future work: • Speed up compression through code optimization • Use a form of PPM in place of BWT for dictionary compression (PPM often achieves a better compression rate but tends to be slow) • Define an XML-conscious grammar-based compression scheme, instead of using “general-purpose” MPM
TREECHOP: Key Features • Carries out an online compression of the XML document tree • Since original document structure is maintained throughout compression, querying can be carried out without requiring decompression • Intended for XML messaging scenarios, where documents are being transmitted over a network • Encoding and querying strategies are based on the XPath standard
XPath • A W3C standard for identifying particular nodes of an XML document tree • Syntax is similar to that used for pathnames in UNIX <class> <students> <instructor> <student> In XPath: /class/students/student
TREECHOP: Compression Strategy • Perform SAX parsing • Generate document tree • Assign a binary code word to each non-leaf tree node • Write tree encoding to compression stream
TREECHOP: Generating the Tree • Encoding has 3 important properties: • Each tree node inherits its parent’s code as a prefix • Two nodes share the same code word iff they have the same XPath location, as traced from tree root downwards • Maintains the structure of the original document throughout the compression process
TREECHOP: Tree Encoding Scheme • Given a non-leaf node N with parent node P, where N is the i-th distinct child node of P:
TREECHOP: Example Tree <class> @name <instructor> <students> “COMP 5113” “Bob Smith” <student> <student> @id @id “Pete Wilson” “Lola Richardson” “100000” “100001”
TREECHOP: Example Tree 00 <class> 00011 0000 00010 @name <instructor> <students> “COMP 5113” “Bob Smith” 0001100 0001100 <student> <student> 000110000 000110000 @id @id “Pete Wilson” “Lola Richardson” “100000” “100001”
TREECHOP: Writing the Tree • Encoded tree is written to compression stream in depth-first order • Non-leaf nodes: written as 4-tuple (L, C, T, D) • L is a byte indicating bit length of code word • C is a sequence of bytes containing code word • T is a byte indicating node type (e.g. element) • D is textual data stored in the node (e.g. element name) - reserved byte values are used to signal beginning/end of stream of raw character data
TREECHOP: Writing the Tree 2 • For 2nd and subsequent occurrences of a non-leaf node, only the 2-tuple (L, C) is transmitted – decoder can then infer T and D • Leaf nodes are written in the manner of D, above – as a stream of raw character data
TREECHOP: Decompression Strategy • A code table is used to keep track of code words processed thus far • Allows future occurrences of a particular (L, C) pair to be mapped to the proper data type & value
TREECHOP: Decompression Strategy 2 • To maintain proper nesting, a stack is used • When a new tree node is processed, continue popping until the node on top of stack does not share a common code word prefix with current node • Last popped node is the parent of the current node
TREECHOP: Querying • Queries are expressed using XPath • Once the equivalent code word for the query predicate has been determined, query matches can be quickly located by searching through the compression stream for other occurrences of the code word
TREECHOP: Querying Example 2Search for all occurrences of ‘/class/students/student’ • Discover code word for ‘/class’ 00 • Discover code word for ‘/class/students’ 00011 • Discover code word for ‘/class/students/student’ 0001100 • Extract data contained in next occurring leaf node “Pete Wilson” • Scan through remainder of compression stream, looking for occurrences of code word 0001100 – occurs once more and the associated data value (“Lola Richardson”) is extracted
TREECHOP: Current State • Java-based implementation is partially completed • Modeled after existing java.net package (e.g. XMLSocket corresponds to Socket)
Future Work • Finish implementation of TREECHOP • Use TREECHOP to validate compressed document using compressed grammars • Applications for XML filtering • Compressed stylesheets