280 likes | 301 Views
XMLTK: An XML Toolkit for Scalable XML Stream Processing. I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu. Motivation. Lots of data sits in large text files ad hoc data formats “Queried” with Unix command line tools grep, sort, tail, etc
E N D
XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu XML Toolkit
Motivation • Lots of data sits in large text files • ad hoc data formats • “Queried” with Unix command line tools • grep, sort, tail, etc • Would be nice to XML-ize it... • ...but then the Unix command line tools won’t work any more. XML Toolkit
Example Text file • In the old Unix world… score decision paperID title • accept P054 “Theory of XML parsing” • reject P021 “Experience with an XML optimizer” • accept P069 “Towards a unified theory of data models” • . . . . . . • Find the top ten rejected papers (in score order): grep “reject” papers.txt | sort | tail 10 XML Toolkit
Example (cont’d) • In the new XML world… • <submissions> • <paper> • <score> 6 </score> • <decision> accept </decision> • <paperID> P054 <paperID> • <title>Theory of XML parsing </title> • </paper> • <paper> • <score> 3 </score> • <decision> reject </decision> • <paperID> P021 </paperID> • <title> Experience with an XML optimizer </title> • </paper> • . . . . . … can’t use those tools anymore XML Toolkit
Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected <paper>s, in <score> order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10 XML Toolkit
Goals of the XML Toolkit Simple, scalable tools for XML processing • Provides service: there are people who need this • Provides a research platform: for XML stream processing XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
The Tools Current tools: • xsort • xagg • xnest • xflatten • xdelete • xpair • xhead • xtail • file2xml • xmill Will talk only about this May look plenty, but actually still incomplete... XML Toolkit
XSort: Definition General form -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)* XML Toolkit
XSort c c c c c c e2 e5 e1 e2 e6 e4 e7 e5 e1 e3 e6 e7 e4 e3 e9 e8 e9 e8 XSort: Definition XML Toolkit
XSort Examples Examples illustrated on data like this: <bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper> . . . . . XML Toolkit
XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the <paper>s, by <title> The <book>s are dropped from the output Compare to… <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . </bib> xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text() XML Toolkit
XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the <author>s, by <lastName> then <firstName> <bib> <author> . . . </author> <author> . . . </author> . . . . . </bib> XML Toolkit
XSort: Examples xsort –c /bib –e paper –e article –e book –e * <paper>s first, then <article>s, then <book>s, then all the rest <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . . </bib> XML Toolkit
XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In <paper>s list the <author>s first; in <book>s list the <title> first; Leave other entries unchanged XML Toolkit
XSort: Implementation • Sorts one context at a time, copies the rest • For each context: • Create a “global key” for each item • Sort items, with a two-pass, multiway merge sort • Quote from Databases 101 (news from the trenches): • with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes ! XML Toolkit
XSort: Performance xsort –c /dblp –e * –k title/text() 1GB ! 8minutes XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
The XPath Processor Common to all tools is the following problem: Given: • Set of correlated XPath expressions • Stream of SAX events Decide: • When are the expressions true variable events XML Toolkit
The XPath Processor How we did it: • All Xpath expressions Deterministic Finite Automaton • Restriction: no predicates yet (current work...) • Does this scale to many, many XPath expressions ? • Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) • Evaluation time is = parsing time • Can do even better with a Stream IndeX (next) XML Toolkit
News: The parser isthe main bottleneckin XPath streamprocessing ! Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets XML Toolkit
Stream IndeX (SIX): Construction XML SIX <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib> XML Toolkit
Skip Parsing Skip Parsing Stream IndeX (SIX): Skip Parsing XPath XML /bib/paper/title. . . <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . . </bib> XML Toolkit
Stream IndeX (SIX) in XML Stream Processing SIX (E.g. DIME) <bib> <book> ... </bib> <bib> <book> ... </bib> <bib> <book> ... </bib> XML The SIX stream is about 6% of the data stream And can be made MUCH smaller XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
Conclusions • The toolkit is already available: • http://www.cs.washington.edu/homes/suciu/XMLTK • http://xmltk.sourceforge.net • What it does so far it does very well: • Sorting, aggregation, nest/unnest • But doesn’t do too much: • Restrictedselections, no projections, no restructurings yet • Volunteers welcome ! • Can one process XML data without parsing it completely ? • SIX XML Toolkit