1 / 28

XMLTK: An XML Toolkit for Scalable XML Stream Processing

XMLTK: An XML Toolkit for Scalable XML Stream Processing. I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu. Motivation. Lots of data sits in large text files ad hoc data formats “Queried” with Unix command line tools grep, sort, tail, etc

ehan
Download Presentation

XMLTK: An XML Toolkit for Scalable XML Stream Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu XML Toolkit

  2. Motivation • Lots of data sits in large text files • ad hoc data formats • “Queried” with Unix command line tools • grep, sort, tail, etc • Would be nice to XML-ize it... • ...but then the Unix command line tools won’t work any more. XML Toolkit

  3. Example Text file • In the old Unix world… score decision paperID title • accept P054 “Theory of XML parsing” • reject P021 “Experience with an XML optimizer” • accept P069 “Towards a unified theory of data models” • . . . . . . • Find the top ten rejected papers (in score order): grep “reject” papers.txt | sort | tail 10 XML Toolkit

  4. Example (cont’d) • In the new XML world… • <submissions> • <paper> • <score> 6 </score> • <decision> accept </decision> • <paperID> P054 <paperID> • <title>Theory of XML parsing </title> • </paper> • <paper> • <score> 3 </score> • <decision> reject </decision> • <paperID> P021 </paperID> • <title> Experience with an XML optimizer </title> • </paper> • . . . . . … can’t use those tools anymore  XML Toolkit

  5. Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected <paper>s, in <score> order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10 XML Toolkit

  6. Goals of the XML Toolkit Simple, scalable tools for XML processing • Provides service: there are people who need this • Provides a research platform: for XML stream processing XML Toolkit

  7. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  8. The Tools Current tools: • xsort • xagg • xnest • xflatten • xdelete • xpair • xhead • xtail • file2xml • xmill Will talk only about this May look plenty, but actually still incomplete... XML Toolkit

  9. XSort: Definition General form -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)* XML Toolkit

  10. XSort c c c c c c e2 e5 e1 e2 e6 e4 e7 e5 e1 e3 e6 e7 e4 e3 e9 e8 e9 e8 XSort: Definition XML Toolkit

  11. XSort Examples Examples illustrated on data like this: <bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper> . . . . . XML Toolkit

  12. XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the <paper>s, by <title> The <book>s are dropped from the output Compare to… <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . </bib> xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text() XML Toolkit

  13. XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the <author>s, by <lastName> then <firstName> <bib> <author> . . . </author> <author> . . . </author> . . . . . </bib> XML Toolkit

  14. XSort: Examples xsort –c /bib –e paper –e article –e book –e * <paper>s first, then <article>s, then <book>s, then all the rest <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . . </bib> XML Toolkit

  15. XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In <paper>s list the <author>s first; in <book>s list the <title> first; Leave other entries unchanged XML Toolkit

  16. XSort: Implementation • Sorts one context at a time, copies the rest • For each context: • Create a “global key” for each item • Sort items, with a two-pass, multiway merge sort • Quote from Databases 101 (news from the trenches): • with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes ! XML Toolkit

  17. XSort: Performance xsort –c /dblp –e * –k title/text() 1GB ! 8minutes XML Toolkit

  18. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  19. The XPath Processor Common to all tools is the following problem: Given: • Set of correlated XPath expressions • Stream of SAX events Decide: • When are the expressions true  variable events XML Toolkit

  20. The XPath Processor How we did it: • All Xpath expressions  Deterministic Finite Automaton • Restriction: no predicates yet (current work...) • Does this scale to many, many XPath expressions ? • Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) • Evaluation time is = parsing time • Can do even better with a Stream IndeX (next) XML Toolkit

  21. News: The parser isthe main bottleneckin XPath streamprocessing ! Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets XML Toolkit

  22. Stream IndeX (SIX): Construction XML SIX <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib> XML Toolkit

  23. Skip Parsing Skip Parsing Stream IndeX (SIX): Skip Parsing XPath XML /bib/paper/title. . . <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . . </bib> XML Toolkit

  24. Stream IndeX (SIX) in XML Stream Processing SIX (E.g. DIME) <bib> <book> ... </bib> <bib> <book> ... </bib> <bib> <book> ... </bib> XML The SIX stream is about 6% of the data stream And can be made MUCH smaller XML Toolkit

  25. XML Toolkit

  26. XML Toolkit

  27. Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

  28. Conclusions • The toolkit is already available: • http://www.cs.washington.edu/homes/suciu/XMLTK • http://xmltk.sourceforge.net • What it does so far it does very well: • Sorting, aggregation, nest/unnest • But doesn’t do too much: • Restrictedselections, no projections, no restructurings yet • Volunteers welcome ! • Can one process XML data without parsing it completely ? • SIX XML Toolkit

More Related