450 likes | 671 Views
XPRESS: A QueriableCompression for XML Data. Jun-Ki Min. Myung-Jae Park. Chin-Wan Chung. By Erhan Durus ü t and Burak Ç etin. Outline. Motivation Background on Compression Algorithms Existing Compressors Features of XPRESS Compression Techniques in XPRESS Experimental Results
E N D
XPRESS: A QueriableCompression for XML Data Jun-Ki Min Myung-Jae Park Chin-Wan Chung By Erhan Durusüt and Burak Çetin
Outline • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Motivation • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Motivation • XML data is irregular and verbose • To overcome the verbosity problem, research on compressors for XML data has been conducted • some XML compressors do not support querying compressed data • Some of them support querying compressed data, they blindly encode tags and data values using predefined encoding methods • So, direct and efficient evaluations of queries on compressed XML data is required XPRESS: A Queriable Compression for XML Data
Background on Compression Algorithms • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Compression Techniques • Purpose of Compression • Required disk space can be reduced significantly • Saving the network bandwidth • Overall performance of database systems • A buffer can hold more information • Number of disk I/Os is reduced XPRESS: A Queriable Compression for XML Data
Classification of Compression Techniques Two scans, one for statistics one for compression We can not use lossy compression since we have text data Statistics gathered dynamically and updated during compression Fixed statistics or no statistics at all XPRESS: A Queriable Compression for XML Data
Classification of Compression Techniques • Static Compression • Dictionary encoding – assigns an integer value to each new word • Example : “the classification of the data” • Encoded : 1 2 3 1 4 • Binary encoding – special types of data can be encoded in binary • Example : “8627” in string • Encoded : 8627 in numeric • Differential encoding – replaces a data item with a code value that defines its relationship to a specific data item • Example : 1500, 1520, 1600, 1550 • Encoded : 1500, 20, 100, 50 XPRESS: A Queriable Compression for XML Data
Classification of Compression Techniques • Semi-adaptive Compression • Huffman encoding -Assign shorter codes to more frequently appearing symbols -Assign 0 to left edge and 1 to the right -Does not keep the order info XPRESS: A Queriable Compression for XML Data
Classification of Compression Techniques • Semi-adaptive Compression • Arithmetic encoding • Symbols are assigned disjoint intervals according to their frequencies • Successive symbols of a message reduce the length of interval of the first symbol in accordance with the frequencies of the symbols. • Example : “a” “b” “c” 0 1.0 “ab” XPRESS: A Queriable Compression for XML Data
Existing Compressors • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Existing Compressors • XMILL • Separates XML tags and attributes from their data values and groups semantically related data values into containers. • XML tags and attributes are compressed by the dictionary encoding method. • To choose the compression algorithm for the container it needs human interpretation. • Finally, they are compressed again by a buildin library called “zlib” XPRESS: A Queriable Compression for XML Data
Existing Compressors • XGRIND • Supports querying compressed XML data • Data values compressed by huffman or dictionary encoding, tags compressed by dictionary encoding • Uses DTD to determine the encoder for data values • A path expression is evaluated by scanning the compressed file and whenever a new tag is found the two path expressions are compared and decided • To evaluate range queries partial decompression of data values is always required XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Reverse arithmetic encoding • Existing XML compressors : each tag by a unique identifier inefficient handling path expressions • Here, a label path as a distinct interval in[0.0, 1.0) • Handling of path expressions : containment relationships XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Automatic Type Inference • Some XML compressors use predefined encodings • E.g. Huffman, dictionary encoding • However, efficiency depends on data type • Some require manual interpretation • Requirement of a type inference engine XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Application of diverse encoding methods to different types • Inferred type – proper encoding methods • numeric: binary encoding Example: ‘120’, ’150’, ’100’, ’130’ Encoded as ‘20’, ’50’, ’0’, ’30’ • textual: huffman encoder • enumeration: dictionary encoder • High compression ratio • Less frequent partial decompression XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Semi-adaptive approach • Preliminary scan for statistics • Statistics not changed during compression • Encoding rules independent to location XPRESS: A Queriable Compression for XML Data
Features of XPRESS • Homomorphic Compression • Preserves the structure of XML data • Efficient extraction XPRESS: A Queriable Compression for XML Data
Reverse Arithmetic Encoding • Simple Path: a sequence of one or more dot-separated tags t1.t2…tn. Example: the simple path of subsectionis book.section.subsection • Label Path: a1.a2…an is the simple path of e. Thus ak,ak+1…an is the label path of e, where 1<=k<n. Example: section.subsection is a label path of subsection • Suffix: two label paths, P=pi…pn and Q=pj…pn of e, if i>=j, the P is a suffix of Q XPRESS: A Queriable Compression for XML Data
Reverse Arithmetic Encoding • First partitions the entire interval [0.0, 1.0) into subinterval, one for each distinct element. The size is proportional to the frequency. Example: frequencies of elements={book, author, title, section, subsection, subtitle} are (0.1, 0.1, 0.1, 0.3, 0.3, 0.1) XPRESS: A Queriable Compression for XML Data
Reverse Arithmetic Encoding • Next, encodes the simple path P=p1…pn of an element e into an subinterval [mine, maxe) XPRESS: A Queriable Compression for XML Data
Reverse Arithmetic Encoding • Property 1:Suppose that a simple path p is represented as the interval I, then all intervals for suffixes of P contain I. Example: simple path book.section.subsection interval [0.69, 0.699) label path section.subsection interval[0.69, 0.78) label path subsection interval[0.6, 0.9) Implication: query processor selects the elements whose corresponding intervals are within the interval of the query. //section/subsection then choose intervals within [0.69, 0.78) • Finally, the start tag of an element is replaced by the value of the subinterval. XPRESS: A Queriable Compression for XML Data
Compression Techniques in XPRESS • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Architecture of XPRESS XPRESS: A Queriable Compression for XML Data
XML Analyzer • Parses each token in XML file while keeping trace of the path • If a tag : collects statistics • If data value : apply type inferencing XPRESS: A Queriable Compression for XML Data
XML Analyzer Algorithm XPRESS: A Queriable Compression for XML Data
Applying Arithmetic Encoder • Problematic: counting appearances of each distinct element • Higher level tags appear rarely (e.g. root) • Intervals for long paths shrink too quickly • Requires use of high-precision numbers • Instead use: Path Tree (Weighted Frequency) XPRESS: A Queriable Compression for XML Data
Weighted Frequency • Weighted Frequency: Number of subnodes + itself • Can consume so much memory; O(E) • Not efficient to construct XPRESS: A Queriable Compression for XML Data
Adjusted Frequency • Add 1 to ancestors whenever a new node is met • Requires O(L) space ; L max. length of a query • Efficient heuristics XPRESS: A Queriable Compression for XML Data
Statistics Collector XPRESS: A Queriable Compression for XML Data
Type Inferencing • Determine whether data is: • Integer • Floating point • Enumaration type • String XPRESS: A Queriable Compression for XML Data
Type Inferencing • Engine keeps track of: • inferred_type • min,max • symhash • chars_frequency • Inferred type can change in the process: • from integer to string • from dictionary to string XPRESS: A Queriable Compression for XML Data
XML Encoder • For data MSB is 0, for structure 1 XPRESS: A Queriable Compression for XML Data
ARAE • ARAE: Approximated Reverse Arithmetic Encoder • Ensures the MSB of encoded value is 1 • Truncates the last byte from float • Truncations does not change the containment relationship • May incure inefficieny if too much truncated XPRESS: A Queriable Compression for XML Data
Encoder Algorithm XPRESS: A Queriable Compression for XML Data
Query Processing • If too long query the interval gets too little • Split query into intervals with sizes greater than 2-15 • Look for sequence of splitted intervals • Generally sequence length is 1 XPRESS: A Queriable Compression for XML Data
Query Processing • Exact matching conditions are encoded • Range queries for numerical values done directly • Partial decompression needed for range queries on strings • Huffman and Dictionary encoding do not preserve order information XPRESS: A Queriable Compression for XML Data
Experimental Results • Motivation • Background on Compression Algorithms • Existing Compressors • Features of XPRESS • Compression Techniques in XPRESS • Experimental Results • Conclusions and Future Work XPRESS: A Queriable Compression for XML Data
Experiments • Extensive experiments on real life data with different characteristics XPRESS: A Queriable Compression for XML Data
Compression Ratios XPRESS: A Queriable Compression for XML Data
Sample Queries • Different types of queries are run: XPRESS: A Queriable Compression for XML Data
Query Evaluation Time XPRESS: A Queriable Compression for XML Data
Conclusion and Future Work • Novel approach “Reverse Arithmetic Encoding” is successful • Superior to XGrind • Future support for complex data types • e.g. Uniform Resource Identifier (URI) XPRESS: A Queriable Compression for XML Data