TREECHOP: A Tree-based Query-able Compressor for XML

TREECHOP: A Tree-based Query-able Compressor for XML Gregory Leighton, Tomasz Müldner, James Diamond Acadia University June 6, 2005

Outline • XML • TREECHOP • Compression Strategy • Decompression Strategy • Querying Strategy • Experimental Results • Conclusions

Extensible Markup Language (XML) • What is it? • A standard for semi-structured data representation introduced in 1998 • Data is surrounded by markup tokens (elements and attributes) used to indicate semantic meaning • Characteristics? • Verbose (often 5 – 10 times larger than alternative formats like CSV) • Lots of repetition… plenty of opportunities for data compression

Example XML Document comment attribute root element data value

TREECHOP: Compression Strategy Parsing splits document into three segments: • Prologue: stores text occurring before document’s root element • Document Tree: contains all document contents between and including root element start and end tags • Epilogue: stores text occurring after document’s root element

Example XML Document Prologue Document Tree Epilogue

Document Tree • Root node corresponds todocument’s root element • Character data segments are represented using leaf nodes • XML markup represented using non-leaf nodes; 5 types of non-leaf nodes: • Element, attribute, CDATA, comment, processing instruction

Document Tree Generation 1 2 Get next token from XML parser Construct tree node from token Write tree node to compression stream 3

Document Tree Nodes Each node in the tree has an associated label value, L • Element node name of the element • Attribute node  ‘@’ + name of the attribute • Comment, CDATA, processing instruction nodes all text between delimiting section markers The path for a node vn consists of /L1/L2…/Ln where a route connecting the root node v1 with vn consists of nodes v1, v2, …, vn and Li is the label for node vi

Codeword Generation • A binary codeword is assigned to each non-leaf node, based on node path • Multiple nodes with identical path are assigned same codeword • Codeword is used during decompression and querying operations to identify the value and type of each node

Codeword Generation • The codeword C(v) assigned to a non-leaf node v with parent node p is formed by the concatenation of three codes • C(p): the codeword assigned to p • G(v): Golomb code assigned to v based on its ordering relative to p. • T(v): a sequence of 3 bits used to indicate node type

Example XML Document

Example Document Tree

Codeword Assignment C(p) – portion inherited from parent node G(v) – portion assigned based on Golomb code T(v) – portion used to indicate node type

TREECHOP: Writing the Tree • Encoded tree is written to compression stream in depth-first order; gzip is applied to further compress the encoded tree • Non-leaf nodes: written as 3-tuple (L, C, D) • L is a byte indicating bit length of code word • C is a sequence of L / 8 bytes containing code word • D is the node’s label (e.g. element/attribute name) - reserved byte values are used to signal beginning/end of sequence of raw character data

TREECHOP: Writing the Tree • On second and subsequent occurrences of a particular codeword, only the 2-tuple (L, C) is written (decoder is able to infer associated D) • Leaf nodes are transmitted in same manner as D value for non-leaf nodes • Each node encoding is transmitted immediately after node construction – avoids necessity of building entire tree in memory

TREECHOP: Decompression Strategy • Decoder operates by reading node data from compression stream. For each non-leaf node: • Determine D value • Determine node type • Surround D with XML syntax appropriate to the node type and immediately emit to the decompression stream

TREECHOP: Querying Strategy • An individual query handler is registered with the decoder for each query • Single scan of compression stream is carried out, using a stack to keep track of current path • When query predicate path is matched, the current codeword is recorded and remainder of compression stream is scanned for future occurrences • Each time a query match is encountered, the associated D value is extracted from the compression stream and passed to the query handler for processing

Experimental Results: Compression Rates

Experimental Results: Compression/Decompression Speed Distance between sender/receiver: 20 km / 12 miles

30000 25000 20000 GZIP/XSLT Query Execution Time (msec) 15000 TREECHOP Raw XML/XSLT 10000 5000 0 2 200 400 600 800 1000 XML Document Size (KB) Experimental Results: Querying Distance between sender/receiver: 20 km / 12 miles

Conclusions • TREECHOP compresses at rates comparable to gzip, while also providing query-friendly annotations to the compression stream • Using TREECHOP querying in place of alternative methods like XSLT yields a significant performance advantage on medium- to large-sized XML documents; advantage increases with document size

TREECHOP: A Tree-based Query-able Compressor for XML