200 likes | 223 Views
This paper presents a specialized XML compression technique that achieves high compression ratios and fast performance, specifically designed for web-based XML data transfer. The approach includes techniques such as separate streams for element and attribute names, dictionary coding, pattern encoding, and variable-length contexts. The scheme aims to reduce the verbosity of XML documents and improve the total transfer time over a network.
E N D
A Highly Efficient XML Compression Schemefor the Web Przemysław Skibiński1, Jakub Swacha2, Szymon Grabowski3 1Uniwersytet Wrocławski, Instytut Informatyki, ul. Joliot-Curie 15, 50-383 Wrocław, Poland. E-mail: inikep@ii.uni.wroc.pl 2 Uniwersytet Szczeciński, Instytut Informatyki w Zarządzaniu, ul. Mickiewicza 64, 71-101 Szczecin,Poland. E-mail: jakubs@sus.univ.szczecin.pl 3 Politechnika Łódzka, Katedra Informatyki Stosowanej, al. Politechniki 11, 90-924 Łódź, Poland. E-mail: sgrabow@kis.p.lodz.pl <conf_data> <conf_name>SOFSEM</conf_name> <conf_location><town>Nový Smokovec</town><country>Slovakia</country></conf_location> <conf_date><month>January</month><year>2008</year></conf_date></conf_data>
What’s wrong with XML XML is textual – good for many reasons.But also verbose...(NEED FOR COMPRESSION!) XML databases can be large: Protein Sequence Database (annotated) – 683 MB DBLP Computer Science – 127 MB. (Lots of information stored, but also a verbose representation.) More and more XML documents exchanged through the Web (the advent of Open XML format in MS Office 2007can only accelerate this trend).
XML compression goals What shall we do, use general-purpose compression(eg. zip, bzip2, ppmd)? Far from optimal (known since 1999 when first XML compressors appeared – e.g. XMill). Compression ratio could be improved. Speed can be improved (maybe not easily with zip though...). Compression ratio / (de)compression speedare typically contradictory criteria; what should we choose? WE CARE FOR A TOTAL TRANSFER TIME(OVER A NET).
Specialized XML compression XMill (Liefke & Suciu, 1999, 2000) – separate streams:element and attribute names; actual content (text),XML document structure.Significant gains esp. with gzip as the back-end compressor. XMLPPM (Cheney, 2001) – switching between different PPM models. Novel idea: injecting a symbol from the prevmodel into the current context (so both the “traditional” and the element related contexts matter). SCMPPM (Adiego et al., 2004) – XMLPPM to the extreme: a separate model for each element path. Beats XMLPPM on large files. But also needs lots of memory.
Redundancy in XML databases Every end tag must match the corresponding start tag each end tag may be replaced with merely a closing flag. Some words appear with high frequency build a dictionary. Not only over tag / attribute names, but also over the textual content. Physical layout often regular encode trailing spaces in linesalmost to zero. Similar thing often works for End-of-Line chars. Decimal system is verbose compact integers (use e.g. base 256).
Our web-compression-oriented transform, bird-flight view Design assumption:dedicated for PPM compressors(e.g. PPMd). Semi-dynamic dictionary:use a byte coding for words that appear at least fmin = 64 times in the document.(The dictionary is front-compressed and stored in the archive.) The notion of word comprises also: start XML tags, URL prefixes ( http://domain/ ), emails,&data, =", "> patterns, runs of spaces. Integers and some other patterns encoded densely.
Dictionary coding 1st pass: gather the words of at least lmin = 2 characters,with least fmin occs, and sort acc. to freqs. Variable-length coding used: from 1 up to 4 bytes. The codeword alphabet: 127-255 range, most 0-31 range + a few more chars. Non-intersecting value ranges for different codeword bytesof size w, x, y, z.Namely: w 1-byters, x • w 2-byters, y•x•w 3-byters, z•y•x•w 4-byters. The parameters w, x, y, z are selected acc. to the size of the created dictionary, with the principle of maximizing the number of short codewords.
Pattern encoding Some patterns: integers, dates (in a specified format),IPs occur frequently, and can be encoded densely in binary. Original idea: XMill. In XWP: automatic detection (no need for DTD or human assistance). • XWP handles: • integers from 1900...2155 (years) – 2 bytes (incl. a flag), • other integers – from 2 to 5 bytes (up to 232), • IP addresses – 5 bytes, • dates (e.g., 1980-02-31, 01-MAR-1920) – 2 or 3 bytes, differential encoding, • times (e.g., 11:30pm, 23:20, 23:30:59) – 3 or 4 bytes, • page ranges – 4 bytes, • floats x.x (0.0...24.9) and .xx – 2 bytes.
Page ranges x-y on 4 bytes: flag, number x on 2 bytes,difference y-x on 1 byte. dblp.xml Encoding of time and range patterns Numbers from 1...12 followed by “am” or “pm”are interpreted as times, and encoded on 3 bytes:time pattern flag,the hour (in 24-h convention), the minutes.
PPMVC(PPM with variable-length contexts)[Skibiński & Grabowski, 2004] Main weakness of most PPM algorithms ispoor handling of long matching sequences(as opposed to LZ77 algs which excel in it). Using high orders (16+):memory-hungry, quite slow, it's hard to overcome the so-called zero-frequency problem. A possible solution: coupling PPM with LZ matching.Original idea: PPM* (Cleary et al., 1995).Another implementation: PPMZ (Bloom, 1998).
PPMVC, cont’d In PPMVC, each max order context holds a pointer to reference context (the prev occ of the context) and the minimum left match length. The left match length (LML) = the length of the common part of the active context and the reference context. LML always at least as large as the maximum PPM order. The right match length (RML) = the length of the matching sequence between symbols to encode and symbols followed by the reference context. If the left match between the current pos and the prev max-order context occurrence is at least minLML, then the RML (0 or more) is sent to the output. If not, plain PPM coding (Shkarin's PPMd) is used. In practice it is better to quantize RML, e.g. round down to a multiple of 8.
Fast PAQ A relatively fast compressor from the PAQ (Mahoney, 2002-2007) family. • PAQ features: • working on bit level, • mixing predictions from various models run in parallel (PPM-like models, string matching model, word model, tabular data model, etc.), • mixing predictions with several neural networks, • adaptive probability maps (APM) mechanism to update the models considering previous experience and the current context, • extremely high compression, extremely slow. • FastPAQ features: • models irrelevant for XML removed, • APM stages simplified, • much faster than PAQ8 for a reasonable loss in compression.
Experiments: methodology etc. The test machine:Intel Core 2 Duo E66002.40 GHz, 1 GB RAM,two Seagate 250 GB SATA drives in RAID mode 1, Windows XP (64-bit). Implementation:C++(Visual C++ 6.0). XML-WRT v3.1 with sources:http://www.ii.uni.wroc.pl/~inikep/. Back-end compressors used: gzip 1.2.4,Pavlov’s LZMA (used in 7-zip), Shkarin’s PPMd, PPMVC, FastPAQ.
Decompression and transmission times XWRT3+PPMVC: best choice for transmission speed up to 384Kbps.For 1 Mbps, it succumbs only to XWRT2 (our prev. scheme) + LZMA.Still, XWRT3 decompression is streamlined.
Conclusions XWRT3 (XWP transform + PPMVC)seems to be best choice for transmitting XML documentsover slow / moderate-speed networks. For high-bandwidth networks: XWP+PPMVC may be slowerin retrieving a document than XWRT2+LZMA,but both the transform (XWP) and the coder (PPMVC)components are streamlined, i.e. immediate displayof the (beginning of the) document is possible. Best XML compression ratios presented so far: with PPMVC, outperforming SCMPPM by 9% on avg, with FastPAQ (alas, impractical) an extra 9% avg gain.