XML Compression

XML Compression Aslam Tajwala Kalyan Chakravorty

Overview • Motivation for XML Compression • Techniques for achieving XML compression • XMill • XMill Architecture

Why Compress XML? • Structured nature of XML makes it understandable to humans, • Downside: XML is verbose • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag> • Ordering of tags is often repeated in a document (e.g. multiple records)

Why Compress XML?: 2 • XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied • Can gain a significant savings from compression, due to highly structured nature of XML • XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents

Using Huffman/LZ • Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values) • Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression

Using Huffman/LZ: 2 • Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip) • Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression

Huffman Encoding Example • ACDABA • Since these are 6 characters, this text is 6 bytes or 48 bits long • tree is build that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111 • 01101110100 (ACDABA = 11 bits)

LZ77 Example( Dictionary Based Compressors) • Lempel-Ziv 77 algorithm • Dictionary is a portion of encoded sequence • The encoder examines the input sequence through a sliding window • The window consists of two parts: • a search buffer that contains a portion of the recently encoded sequence, and • a look-ahead buffer that contains the next portion of the sequence to be encoded.

XMill (Liefke and Suciu, 2000) • Relies heavily on zlib, the compression library used in gzip • Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API) • During compression, each XML tag is examined to see which compression technique(s) should be applied

XML Compression • View XML as a tree • Separate the tree structure and what is stored in leaves • Save the tree structure so that it can be restored • The compressed file may or may not remember the tree structure breadfruit tree

XMill: Compression Strategy • XMill applies 3 principles during compression: • Separate structure (element tags and attribute names) from data • Group related data items in a single container; compress each container separately • Apply appropriate semantic compressors to each container

XMill – Separating Structure From Content • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.) • End tags replaced with ‘/’ token • Data values replaced with their container number

XMill – Separating Structure From Content 2 <Employees> <Employee id=“1”>Homer Simpson</Employee> <Employee id=“2”>Frank Grimes</Employee> </Employees> Structure Container T1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / / Dictionary T1 =>Employees T2 => Employee T3 => @id C3 1 2 C4 Homer Simpson Frank Grimes

XMill: Container Expressions • Users can override default settings using the container expression language • Specify container membership • Specify which semantic compressor(s) are applied for each container • E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container: xmill –p //(Name | Location) employees.xml

XMill: Semantic Compressors

XMill: Semantic Compressors 2

XMill: Semantic Compressors 3 • Text compressor is applied to each element by default • User can add other instructions via command line: xmill –p //price=>i file.xml Applies integer compressor to each occurrence of ‘price’ element in file.xml

XMill Architecture (1/3)

XMill Architecture (2/3) • SAX Parser • sends tokens to the path processor. • Path Processor • determines how to map data values to containers. • Semantic Compressors • compresses the input and copies it to the container – in the memory window. • E.x. binary encoding of integers, differential compressors. When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.

Performance Evaluation (1/2)

Performance Evaluation (2/2)

References • XMill:An efficent Compressor for XML Data • XGrind:A query friendly compressor • www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt

Questions ?

XML Compression

XML Compression

Presentation Transcript

Compression

XML Compression Techniques

Compression

Compression

Compression

Secure Layer Based Compound Image Compression using XML Compression

A Highly Efficient XML Compression Scheme for the Web

Compression

Compression

Optimizing XML Compression

Combining efficient XML compression with query processing

Compression

XML Compression Techniques: Survey and Comparison

Compression

XML Compression and Indexing

A Highly Efficient XML Compression Scheme for the Web

Compression

XML Compression