230 likes | 250 Views
Explore XML compression benefits, techniques like Huffman/LZ, XMill architecture, and why XML compression is essential for reducing file size. Learn about semantic compressors and XMill's strategy for effective XML data compression.
E N D
XML Compression Aslam Tajwala Kalyan Chakravorty
Overview • Motivation for XML Compression • Techniques for achieving XML compression • XMill • XMill Architecture
Why Compress XML? • Structured nature of XML makes it understandable to humans, • Downside: XML is verbose • Each non-empty element tag must end with a matching closing tag -- <tag>data</tag> • Ordering of tags is often repeated in a document (e.g. multiple records)
Why Compress XML?: 2 • XML documents are text-based: well-known compression schemes such as Huffman and LZ can be easily applied • Can gain a significant savings from compression, due to highly structured nature of XML • XML is being used more frequently in real-time applications (e.g. web service-based e-commerce applications); increasing interest in finding ways to reduce overall size of XML documents
Using Huffman/LZ • Usually some degree of repetition in an XML document (multiple occurrences of tags, attribute or data values) • Compression schemes like Huffman and LZ can use this repetition to achieve some degree of compression
Using Huffman/LZ: 2 • Many existing (and efficient) implementations of these algorithms are readily available (e.g. gzip) • Downside is that these techniques aren’t fully capable of exploiting the structure of XML to achieve greater compression
Huffman Encoding Example • ACDABA • Since these are 6 characters, this text is 6 bytes or 48 bits long • tree is build that replaces the symbols by shorter bit sequences. In this particular case, the algorithm would use the following substitution table: A=0, B=10, C=110, D=111 • 01101110100 (ACDABA = 11 bits)
LZ77 Example( Dictionary Based Compressors) • Lempel-Ziv 77 algorithm • Dictionary is a portion of encoded sequence • The encoder examines the input sequence through a sliding window • The window consists of two parts: • a search buffer that contains a portion of the recently encoded sequence, and • a look-ahead buffer that contains the next portion of the sequence to be encoded.
XMill (Liefke and Suciu, 2000) • Relies heavily on zlib, the compression library used in gzip • Also defines a few data type specific compressors; user-defined compressors can be added using SCAPI (Semantic Compressor API) • During compression, each XML tag is examined to see which compression technique(s) should be applied
XML Compression • View XML as a tree • Separate the tree structure and what is stored in leaves • Save the tree structure so that it can be restored • The compressed file may or may not remember the tree structure breadfruit tree
XMill: Compression Strategy • XMill applies 3 principles during compression: • Separate structure (element tags and attribute names) from data • Group related data items in a single container; compress each container separately • Apply appropriate semantic compressors to each container
XMill – Separating Structure From Content • Start tags and attribute names are dictionary-encoded (as T1, T2, etc.) • End tags replaced with ‘/’ token • Data values replaced with their container number
XMill – Separating Structure From Content 2 <Employees> <Employee id=“1”>Homer Simpson</Employee> <Employee id=“2”>Frank Grimes</Employee> </Employees> Structure Container T1 T2 T3 C3 / C4 / T2 T3 C3 / C4 / / Dictionary T1 =>Employees T2 => Employee T3 => @id C3 1 2 C4 Homer Simpson Frank Grimes
XMill: Container Expressions • Users can override default settings using the container expression language • Specify container membership • Specify which semantic compressor(s) are applied for each container • E.g. to indicate all ‘Name’ and ‘Location’ tags should be grouped in the same container: xmill –p //(Name | Location) employees.xml
XMill: Semantic Compressors 3 • Text compressor is applied to each element by default • User can add other instructions via command line: xmill –p //price=>i file.xml Applies integer compressor to each occurrence of ‘price’ element in file.xml
XMill Architecture (2/3) • SAX Parser • sends tokens to the path processor. • Path Processor • determines how to map data values to containers. • Semantic Compressors • compresses the input and copies it to the container – in the memory window. • E.x. binary encoding of integers, differential compressors. When the window is filled, all containers are gzipped, stored on disk, and the compression resumes.
References • XMill:An efficent Compressor for XML Data • XGrind:A query friendly compressor • www.cs.washington.edu/homes/ suciu/COURSES/590DS/19compression.ppt