200 likes | 368 Views
XML Compression Techniques: Survey and Comparison. Angela McCarthy CP5080, SP1 2010. Overview. Received: 14 August 2008 Revised: 13 November 2008 Written by Sherif Sakr of University of New South Wales, Australia
E N D
XML Compression Techniques: Survey and Comparison Angela McCarthy CP5080, SP1 2010
Overview • Received: 14 August 2008 • Revised: 13 November 2008 • Written by SherifSakr of University of New South Wales, Australia • eXtensibleMarkupLanguage (XML), standard for data representation over World Wide Web • Large document sizes, compression introduced to deal with issues • Paper provides survey over compression techniques
Introduction • Author looking at XML compression techniques and launch a study • Surveys each of the different compression techniques and compares advantages and disadvantages of each • Data transmitted online is rather large • XML usage is growing, thus a demand for efficient XML compression tools exists
Introduction • Contributions made: • Comprehensive survey of XML compression techniques • A rich XML corpus collected and constructed • Contains wide variety of XML data sources, natures and document sizes • Detailed results examining performance and characteristics • Work repeatable • Webpage of study provides access to test files, examined XML compressors and detailed results of study
Classifications • Each section goes through each of the classifications of compressors • General Text Compressors • Treats XML as plain text, uses traditional text compression techniques • XML Conscious Compressors • Takes advantage of awareness of XML files • Uses document structure to achieve better compression rates
Classifications • Non-Queriable (Archival) XML Compressors • No queries can be processed over compressed format • Focus to achieve highest compression ratio • Queriable XML Compressors • Queries can be processed over compressed format • Compression ratio actually worse then archival XML compressors • Focus to avoid full document decompression during query execution
XML Testing Corpus • Large variety of data sets (see previous) • From 0.5MB to 1.3GB • Four Categories • Structural Documents • Textual Documents • Regular Documents • Irregular Documents • Testing Environments • To ensure consistency, two different were environments used, high VS low
Performance Metrics • Performance Metrics measured and compared • Compression Ratio • Ratio between sizes of compressed and uncompressed • Compression Ratio = (Compressed Size)/(Uncompressed Size) • Compression Time • Elapsed time during compression process • Decompression Time • Elapsed time during decompression process • The lower the metric value, the better the compressor
Framework • 11 XML Compressors Evaluated • Three general purpose text compressors • Gzip, bzip2, PPM • Eight XML conscious compressors • XMillGzip, XMillBzip, XMillPPM, XMLPPM, SCMPPM, XWRT, AXECHOP • Compressors evaluated under default settings • Additional experiments run with tuned parameters for highest level of compression paramters • In total, 16 variant compressors
Results • Ideally want to provide a global ranking on XML compression tools • Results show there is noclear winner • Dependant upon the weight of each metric • Three ranking functions • – WF1 = (1/3 ∗ CR)+(1/3 ∗ CT)+(1/3 ∗ DCT) • – WF2 = (1/2 ∗ CR)+(1/4 ∗ CT)+(1/4 ∗ DCT) • – WF3 = (3/5 ∗ CR)+(1/5 ∗ CT)+(1/5 ∗ DCT) • CR represents the compression ratio metric, CT represents the compression time metric and DCT represents the decompression time metric
Conclusions • Paper surveyed state-of-the-art XML compression techniques • Reported the behaviour of various different XML compressors using large corpus of XML documents • Paper could be valuable for • Developers of new XML compression tools • Users for making an effective decision on most suitable compressor for requirements • Fig 7. Shows none of XML conscious compressors has achieved outstanding compression ratio
Future Work • Planning to continue maintaining and updating webpage of study with further evaluations • Enable visitors to perform online experiments using set of available compressors and own XML documents
Metadata • Large number of references • Due to different compression techniques used • Large amount of data • Thorough in research methods • Large amount of data tested • Tested on different systems • Tested using different techniques • Abbreviations/Acronyms given • Designed for specific audience • Paper seems to be a reference tool • User to read to help decide on which compression tool to use
Questions? Thanks for listening!