220 likes | 232 Views
Learn about XAUST, a unique tool utilizing DTD for compressing XML documents automatically. Explore XML structure, DTD concepts, arithmetic coding, implementation details, and experimental comparisons with other compression methods like XMill and gzip. Experience the benefits of utilizing DTD for better compression ratios and understand the automatic multiplexing achieved by XAUST.
E N D
Compressing XML Documents with Finite State Automata S. Hariharan Priti Shankar
Organization • An introduction to XAUST • A brief introduction to XML & DTD • Previous work : XMill • Utility of DTD • Implementation • Experimental results • Conclusion
XAUST: Introduction • XML Compression with AUtomata and STack • It is the only tool using DTD for compressing documents automatically • It generates code for an arbitrary DTD and compresses documents conforming to that DTD
root opentag student content attribute id name 1 Hariharan closetag XML • XML describes a class of data objects called XML documents • The presence of tags makes the document verbose • Attributes are treated as tags • XML document is characterized as a tree <root> <student id = “1”> <name> Hariharan </name> </student> </root>
DTD • DTD describes a document • DTD for the above XML fragment is given below <!DOCTYPE StudentInfo [ <!ELEMENT root (student*)> <!ELEMENT student (name | rollNo, comment?)> <!ATTLIST student id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT rollNo (#PCDATA)> <!ELEMENT comment (#PCDATA)> ]> * means zero or more occurrences + means one or more occurrences ? means zero or one occurrence | means either-or , means and #PCDATA means textual content
Arithmetic coding Arithmetic coding replaces a string of symbols with a single floating number • We wish to encode bacc! • After seeing bthe encoder narrows it down to the interval [0.2, 0.5) • After seeing a the interval is further narrowed down to 1/5th of itself • Hence the new interval becomes [0.2, 0.6) • After seeing the first c the interval becomes [0.23, 0.236) • The final interval is [0.23354, 0.2336)
Consider the XML fragment <book> <title> xyz </title> <isbn> 123 </isbn> </book> Previous work: XMill • XMill separates structure from text • XMill has structure container and text containers book = #1, title = #2, isbn = #3 Structure = #1 #2 C1 / #3 C2 / / There will be many repeated sequences of structure like the one above and are compressed The text is mapped to the container depending on the path from the root
Motivation for XAUST • XMill does not use DTD • We propose a scheme using DTD to achieve better compression ratio • Automatic multiplexing is achieved
A sample fragment of the XML document card addressBook card card name email givenName name familyName email email note Tags need not be encoded Utility of DTD • Consider the Elements addressBook and card <!ELEMENT addressBook ( card* )> <!ELEMENT card ( (name | (givenName, familyName)), email, note? )>
<!ELEMENT card ( (name | (givenName, familyName)), email, note? )> regular expression • The DFA for the corresponding regular expression is given below note 1 4 5 name givenName email 2 3 familyName Regular Expression
note 4 5 name 1 email givenName familyName 2 3 Regular Expression Contd., • We can see that we need not encode familyName and email tags • They are the states with single output transition • But state 4 is a state with multiple transition. There is an implicit transition to the parent ELEMENT addressBook ( <!ELEMENT addressBook (card*)> )
The tree for the DTD A D C B C #PCDATA #PCDATA #PCDATA Encoding text Consider a fragment of DTD <!ELEMENT A (C, B, D) > <!ELEMENT B (C) > <!ELEMENT C (#PCDATA) > <!ELEMENT D (#PCDATA) > • 3 choices for encoding text • A single container for all text • A single container for each element • A single container for each leaf node of the DTD tree
Encoding text contd., Advantages and disadvantages of the 3 approaches • A single container for all the text Advantage: Low memory consumption Disadvantage: Low compression ratio • A single container for each Element Advantage: Medium memory consumption • A single container for each leaf node in the DTD tree Disadvantage: High memory consumption
Encoding text contd., • Not much difference in the compression ratio between the options 2 and 3 • Option 2 i.e., ‘A single container for each Element’ was chosen as it entails less memory consumption • As text is character data we used Arithmetic compression • Order-4 Adaptive Arithmetic compression is used
Implementation • An automaton is generated for each Element • Stack is used for storing the current state of ancestor element
Experiments • We conducted our experiments on XMark, DBLP, Uniprot, Michigan and X007 documents • XAUST is compared with XMill, XMLPPM and gzip • XMLPPM ran out of memory for Uniprot and Michigan documents • XMill, XMLPPM and gzip are better than XAUST for one document each • Compression ratio is defined as the ratio between the size of the compressed document and the size of the original document
Conclusion and Future Work • The results amply justify the utility of DTD • Large documents are compressed without running out of memory • The only restriction placed by XAUST is the presence of DTD • Future work will concentrate on querying the compressed document
References • Hartmut Liefke, Dan Suciu.: XMill: An efficient compressor for XML data, Proceedings of ACM SIGMOD, 2000 • Nelson, M: Arithmetic Coding. Dr. Dobbs Journal http://dogma.net/markn/articles/arith/part1.htm • UniProt : http://www.ebi.uniprot.org • Michigan: http://www.eecs.umich.edu/db/mbench • XOO7: http://www.comp.nus.edu.sg/ebh/XOO7.html • XMark: http://monetdb.cwi.nl/xml/generator.html • DBLP: http://www.informatik.uni-trier.de/~ley/db