480 likes | 732 Views
XML Compression Techniques. Gregory Leighton Web Data Management Lab Department of Computer Science University of Calgary June 24, 2008. Outline. XML primer XML-conscious compression schemes Non-queryable schemes Queryable schemes Future directions. The Extensible Markup Language (XML).
E N D
XML Compression Techniques Gregory Leighton Web Data Management Lab Department of Computer Science University of Calgary June 24, 2008
Outline • XML primer • XML-conscious compression schemes • Non-queryable schemes • Queryable schemes • Future directions XML Compression Techniques
The Extensible Markup Language (XML) • A W3C-endorsed standard • Originally intended as a web document authoring format, has since become a popular method for encoding semi-structured data • Data integration, data exchange applications • Support for native (tree-based) storage of XML in most commercial and open source DBMSs • MySQL, DB2, Oracle, SQL Server • A markup language: text content (PCDATA) can be surrounded by descriptive markup (elements and attributes) XML Compression Techniques
Example: An XML Document <course> <name>CS 501</name> <instructor>Ron Charles</instructor> <students> <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> <student name=“Bob“> <a1>69</a1> <a2>71</a2> <midterm>82</midterm> <finalexam>60</finalexam> </student> </students> </course> XML Compression Techniques
XML Data Model • XML documents can be modeled as ordered, labeled trees • Preserves ancestor-descendent relationships and sibling ordering • Each node is assigned a unique ordinal value corresponding to its position in a pre-order traversal • Internal nodes represent elements or attributes • Leaf nodes represent text content (PCDATA) or attribute values • Label of incident edge stores the node’s class (for internal nodes) or text content (for leaf nodes) XML Compression Techniques
Example: XML Document Tree Path Expression: /course Textual Representation: <course></course> Path Expression: /course/name/text() Textual Representation: <course> <name>CS 501</name> </course> Path Expression: /course/name Textual Representation: <course> <name></name> </course> course name o1 instructor students o2 o4 student student o6 “CS 501” “Ron Charles” o5 o7 o3 o18 @name @name a1 finalexam a1 project a2 a2 midterm midterm o16 o8 o10 o12 o14 o19 o21 o23 o25 o27 “Alice” “86” “91” “87” “Bob” “78” “69” “71” “82” “60” o13 o9 o11 o15 o17 o20 o22 o24 o26 o28 XML Compression Techniques
XML Technologies Query Languages Parsing APIs Document Object Model (DOM): produces an in-memory representation of document tree Efficient traversal Large memory consumption Simple API for XML (SAX): event-based, depth-first parsing of document Much smaller memory consumption Some navigation operations become expensive (allows serial access only) • XPath allows tree nodes to be selected based on their position and/or value • /course/name/text() • //student[@name=“Alice”] • XQuery: a higher-level, declarative query language that incorporates XPath XML Compression Techniques
XML-Conscious Compression Schemes XML Compression Techniques
Taxonomy compressors XML-conscious • generic text • bzip2 • gzip • PPM variants • Etc. queryable • non-queryable • AXECHOP • DTDPPM • SCMPPM • XComp • XMill • XMLPPM • XWRT • sequential • access • BPLEX • TREECHOP • XGRIND • XPRESS • random • access • XQueC • XSeq XML Compression Techniques
Other Classification Criteria • Schema-aware or schema-oblivious? • Information in schema documents can allow document structure to be encoded more succinctly • Some schema languages (e.g., XML Schema) specify data types – this knowledge can be used to guide selection of compression schemes • Limited applicability: not all documents have an associated schema document • Online vs. offline operation • Can decompression be carried out incrementally? • Compression paradigm used • Homomorphic or permutation-based? XML Compression Techniques
Schema-aware Compression: Example <!ELEMENT course (name,instructor,students)> Indicates that each <course> element must have <name>, <instructor>, and <students> elements as children (in that order): no unpredictability If encoder and decoder both possess the DTD: 0 bits needed to represent structure of <course> elements! <!ELEMENT student (a1,a2,midterm,(finalexam|project))> Similarly, only 1 bit is needed to indicate whether <finalexam> or <project> appears as the fourth child of <student> XML Compression Techniques
Other Classification CriteriaIntendedApplication Domains • Archiving: volumes of data are compressed to preserve disk space, not accessed frequently • E.g., Web server logs • Priorities: compression ratio, compression speed • Data exchange: for data transferred over a network, the key goals are to improve throughput and reduce bandwidth consumption • E.g., web services, instant messaging • Priorities: compression/decompression speed, online operation • Database/IR applications: declarative queries are issued over XML documents; compression can improve query performance by reducing number of I/O operations • E.g., scientific databases • Priorities: queryable compression w/ random access, decompression speed XML Compression Techniques
Permutation-based Approaches Document is rearranged to localize repetitions before passing to back-end compressor(s) • Data segments are grouped into different containers, typically based on the identity of parent element • Tag structure (“skeleton”) and data segments are compressed separately XML Document Shredder Skeleton Data Containers Structure Compressor Data Compressor XML Compression Techniques
Homomorphic Approaches • Each XML token is compressed individually, “in-place” • Compression process maintains structure of original document • Poorer compression, but easier to query than permutation-based approaches (less fragmentation) XML Compression Techniques
Non-Queryable Compressors XML Compression Techniques
XMill (Liefke & Suciu, 2000) • Introduced idea of separately compressing document structure and data, container grouping of text segments • gzip is used as back-end compressor • Data-centric XML: often beats gzip’s compression of medium- to large-sized XML documents (> 20 KB) by 35%-60% • Document-centric XML: little to no improvement over gzip’s compression rate • Compression/decompression speed comparable to gzip’s XML Compression Techniques
XMill Compression strategy is based on 3 principles: • Separation of structure from data • Start tags, attributes are assigned a binary codeword • Container-based storage of data segments, using path-based partitioning by default • Custom partitioning policies can be defined using a container expression language • Semantic compressors may optionally be applied to each container • E.g., differential encoder for numeric data; specialized compressors for handling specific formats like dates and URLs XML Compression Techniques
XMillExample <a> <b> <c>text 1</c> </b> <b> <c>text 1</c> </b> <c>text 3</c> <d>text 4</d> </a> gzip gzip gzip gzip Compressed File XML Compression Techniques
XMillRelated Approaches • XComp (Li, 2003): groups data values into containers based on <label, level, node_type>, then applies gzip to containers and structural summary; little (<2%) to no improvement over XMill’s compression rate or time • AXECHOP (Leighton, 2005a): applies grammar-based scheme (MPM) to compress structural summary, uses bzlib for container compression; outperforms XBMill on most documents XML Compression Techniques
XMLPPM (Cheney, 2001) • Compression process is centered around two concepts: • Encoded SAX parsing (ESAX): each SAX event is replaced with a more succinct encoding • Multiplexed hierarchical modeling (MHM): separate PPM models are maintained for elements, attributes, textual content, and characters • Additional symbols are injected into models to preserve the context formed by original tag hierarchy • A single arithmetic coder is shared between all models • Often compresses 15-35% better than XMill; main drawback is slow operation XML Compression Techniques
XMLPPMRelated Approaches • SCMPPM (Adiego et al, 2004): maintains a separate model for every distinct element/attribute; often achieves better compression than XMLPPM • DTDPPM (Cheney, 2005): consults DTD to increase accuracy of symbol prediction; only effective on small (no more than a few MBs) and highly-structured documents XML Compression Techniques
XWRT (Skibiński et al, 2007) XML Document • The “XML Word-Replacing Transform” • Pre-processes document in hopes of boosting compression performance of selected back-end scheme (LZ77 or LZMA) • XWRT + LZMA often outperforms XMLPPM and SCMPPM in compression ratio, while offering faster performance! XWRT Pre-processed Document LZ77/LZMA Compressed file XML Compression Techniques
XWRTPre-processing Techniques • Dynamic dictionary of frequently used words • Grouping of data values into containers, based on name of encapsulating element/attribute • Use of additional containers to encode some types of data with a predictable format • Numbers, dates, times XML Compression Techniques
Non-Queryable Compressors: Summary XML Compression Techniques
Queryable Compressors XML Compression Techniques
XGRIND (Tolani & Haritsa, 2002) • Encodes elements and attributes using XMill’s approach • DTD-conscious: enumerated attributes with k possible values are encoded using a log2k-bit scheme • Data values are encoded using non-adaptive Huffman coding • Requires two passes over the input document • Separate statistical model for each element/attribute • Homomorphic compression: compressed document retains original structure XML Compression Techniques
XGRIND Original Fragment: Compressed Fragment: T0 A0 nahuff(Alice) T1 nahuff(78) / T2 nahuff(86) / T3 nahuff(91) / T4 nahuff(87) / / <student name=“Alice“> <a1>78</a1> <a2>86</a2> <midterm>91</midterm> <project>87</project> </student> XML Compression Techniques
XGRIND • Many queries can be carried out entirely in compressed domain • Exact-match, prefix-match • Some others require only decompression of relevant values • Range, substring • Queryability comes at the expense of achievable compression ratio: typically within 65-75% that of XMill XML Compression Techniques
XPRESS (Min et al, 2003) • Like XGRIND, features a homomorphic compression process requiring two passes over the document • Defines a process called reverse arithmetic encoding for representing each distinct root-to-node label path as an interval in [0.0, 1.0) • For intervals I1, I2: if I1 contains I2, then path P1 is a suffix of path P2 • This feature facilitates execution of ancestor-descendent queries in compressed domain XML Compression Techniques
XPRESS • Attempts to automatically infer types of data values, and then applies a type-specific compression scheme • Numeric data: differential encoding scheme • Textual data: 1-byte dictionary encoding scheme (if < 128 distinct values), Huffman otherwise • Queries supported in compressed domain: • Numeric data: exact match, range queries • Textual data: exact match • Tends to compress better than XGrind, while also operating 2-3 times faster on typical XPath queries • XMill’s compression rate tends to be ~20-25% better XML Compression Techniques
XSeq (Lin et al, 2005) • First phase of compression groups tags and data segments into structure and data value containers • The contents of each container are then compressed using a grammar-based scheme (SEQUITUR) • Several indices are constructed to enable efficient, random-access querying over compressed containers • Supports exact-match, range queries in compressed domain • Compression rate: comparable to gzip, as much as 30% better than XGrind’s XML Compression Techniques
XSeqIndices • File header: consists of • a list of pointers to structure container index and to each data value container index • a table recording mappings b/w tags and substitution codes used in structure container • Structure container indexanddata value container indices: record information about the SEQUITUR grammar generated from document’s tag sequence (resp., from contents of individual data value container) • number of grammar rules • number of symbols in each rule’s RHS • for each rule, occurrence counts of terminal symbols in the RHS XML Compression Techniques
XSeqQuery Processing Example: /course/students/student[@name=“Alice”] • Determine token sequences relevant for query by consulting mapping table stored in file header • Consult structure index to determine which grammar rules contain relevant tokens • Expand out relevant rules and extract data values from appropriate data value containers XML Compression Techniques
TREECHOP (Leighton et al, 2005b) • A queryable compressor intended mainly for data exchange applications • Recipient receives a stream of XML data, wants to selectively process certain nodes • Avoids necessity of decompressing entire data stream in memory to extract values • Single-pass compression process: tree nodes are encoded adaptively during depth-first traversal of document tree, and passed to a back-end gzip compressor • Achieves compression rates comparable to gzip, while allowing selection of random nodes; compression/decompression speed is slightly less than gzip’s XML Compression Techniques
TREECHOP • Non-leaf nodes are encoded as a binary code with three constituent parts: • the codeword of the parent node p • a variable-length Golomb codeword recording the relative position of this node w.r.t. p • a fixed-length code indicating the node type (element, attribute, comment, CDATA, or processing instruction) • For first occurrence only, the node text (e.g., element name) is appended immediately after the codeword • Leaf nodes are written as raw text sequences • Queries supported in the compressed domain: exact match and range XML Compression Techniques
XQueC (Arion et al, 2004) • Designed to support a large subset of XQuery in the compressed domain • Prototype uses either ALM (order-preserving) or Huffman (unordered) to individually compress data values • ALM: supports equality/inequality matching, doesn’t support prefix-matching • Huffman: supports equality and prefix-matching, doesn’t support inequality matching • A permutation-based strategy is used, in which data values are first assigned to containers according to their parent element’s path • Containers are later grouped together to share the same compression model if they exhibit high similarity and appear frequently together in query predicates XML Compression Techniques
XQueC • Given a workload of typical queries, an attempt is made to determine the optimal container grouping, and assignment of the best available compression algorithm to apply to each group, in order to minimize the following costs • Decompression time • Storage costs for compressed data and source models • In addition to containers, the compression process creates two additional data structures: • Structure tree • Structure summary XML Compression Techniques
XQueC (Arion et al, 2007)Structure Tree & Structure Summary course [1, 8] course Structure Tree students students [2, 7] Structure Summary student student [3, 3] student [6,6] @name a1 @name [4, 1] a1 [5, 2] @name [7, 4] a1 [8, 5] Header “Alice” Container “Bob” XML Compression Techniques
BPLEX (Busatto et al, 2005) • Focuses on improving the compression of XML structure, by searching bottom-up for repeated patterns in the document’s minimal DAG • A succinct pointer-based representation is used to represent subsequent occurrences of a repeated subgraph • DOM operations can be carried out on the compressed skeleton • Achieves smallest queryable representation of XML structure: averages 68% of XMill’s compressed size with gzip as back-end compressor! (Maneth et al, 2008) XML Compression Techniques
BPLEX Buneman et al (2003) demonstrate that Core XPath queries can be efficiently evaluated directly on the minimum DAG of a skeleton a a DAG is 2/3 smaller than original skeleton b b b b c d c d c d c d In theory, a DAG can be exponentially smaller than the original skeleton; “real world” XML DAGS are often less than 10% of the original document size XML Compression Techniques
BPLEX DAGs are limited to subtree sharing, meaning they can miss out on repeated patterns occurring in the interior of the skeleton… a In this example, the DAG is equivalent to the original skeleton – no compression a b a BPLEX generates a straight-line tree grammar (SLT grammar) that is capable of representing repeated, connected subgraphs c b f a d b y2 e b SLTs are frequently half the size of the equivalent minimal DAG y1 XML Compression Techniques
Queryable Compressors: Summary XML Compression Techniques
Future Directions XML Compression Techniques
XML Updates • Impetus for incremental update of existing XML data sets is increasing • XML-based office document standards: ODF, OOXML • Increased volume of persistent XML data • W3C has recently proposed an extension to XQuery for expressing node-level updates • So far, most approaches to XML compression have assumed a “read-only” model • How amenable are the existing schemes to updates? XML Compression Techniques
Evaluation of XML-Conscious Techniques • Currently, it is difficult to make definitive statements about the relative effectiveness of different techniques… • lack of available implementations • no consistent benchmark • each approach tends to use its own corpus • many queryable compressors aren’t tested thoroughly on the full set of supported queries • most works provide no theoretical justification; instead rely on empirical results against a limited corpus XML Compression Techniques
Thank you XML Compression Techniques
References • J. Adiego, P. De la Fuente, and G. Navarro. Combining structural and textual contexts for compressing semistructured databases. ENC, 2005. • A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese. XQueC: A query-conscious compressed XML database. ACM TOIT 7(2), 2007. • P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. VLDB, 2003. • G. Busatto, M. Lohrey, and S. Maneth. Efficient memory representation of XML documents. DBPL, 2005. • J. Cheney. Compressing XML with multiplexed hierarchical PPM models. DCC, 2001. • J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. WebDB,2005. • J. Cheney. Tradeoffs in XML database compression. DCC, 2006. • J. Cheng and W. Ng. XQzip: querying compressed XML using structural indexing. EDBT, 2004. • G. Leighton, J. Diamond, and T. Müldner. AXECHOP: a grammar-based compressor for XML. DCC, 2005. XML Compression Techniques
References (cont.) • G. Leighton, T. Müldner, and J. Diamond. TREECHOP: a tree-based query-able compressor for XML. CWIT, 2005. • W. Li. XComp: An XML compression tool. Master's thesis, University of Waterloo, 2003. • H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. SIGMOD, 2000. • S. Maneth, N. Mihaylov, and S. Sakr. XML tree structure compression. XANTEC, 2008. • J. Min, M. Park, and C. Chung. XPRESS: a queriable compression for XML data. SIGMOD, 2003. • P. Skibiński, Sz. Grabowski, and J. Swacha. Effective asymmetric XML compression. To appear in: Software: Practice and Experience, 2008. • P. Tolani and J. Haritsa. XGRIND: a query-friendly XML compressor. ICDE, 2002. XML Compression Techniques