500 likes | 580 Views
From Searching Text to Querying XML Streams. Dan Suciu www.cs.washington.edu/homes/suciu. About Me. Born 1957, Romania BS: Bucharest, PhD: University of Pennsylvania Now: University of Washington (Seattle) My work is on semistructured data
E N D
From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu XML Toolkit
About Me • Born 1957, Romania • BS: Bucharest, PhD: University of Pennsylvania • Now: University of Washington (Seattle) My work is on semistructured data • Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: • XML-QL = precursor of XQuery • XMill = the XML compressor • XML toolkit XML Toolkit
Motivation • Text databases • Studied over the past 15 years • Traditional client/server model • Struggled with lack of standard text syntax • Recently, new standard: XML • Traditional client/server: in today’s dbms • New applications: stream processing • This talk: processing stream XML data • My motivation: work on the XML Toolkit project XML Toolkit
Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit
Background:Relational Databases • Structured, stored in tables • Schema separate from data • Queries: precise, refer to schema and data (SQL) Hard to publish, easy to query precisely XML Toolkit
Background:Text Databases • Unstructured, stored in documents • No schema, only data • Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely XML Toolkit
Background:XML Data • Semistructured • Schema and data are together: self-describing • Queries: precise, refer to schema and data (SQL) • <bib> • <book> <title> Foundations… </title> • <author> <name> Abiteboul </name> • <country> FR </country> • </author> • <author> <name> Hull </name> • <country> USA </country> • </author> • <author> <name> Vianu </name> • <country> USA </country> • </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • … • </bib> XML: Easier to publish,easy to query precisely XML Toolkit
Background:XML Data Data model = tree bib paper book book title author journal title author author publisher author Addison Wesley name country Data on the Web name country Buneman UK Abiteboul FR XML Toolkit
Background:XML Data • Querying with XPath (and XQuery) • This talk: XPath queries restricted to: tag / // * [ ] path=“constant” XML Toolkit
Background:XPath in One Slide tag, / /bib/book/author/name //,* Navigate partially known structure /bib/book//name/*/zip Conjunctivequeries ala SQL /bib/book[author/name=“Abiteboul”] [ ] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] XML Toolkit
Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit
Main Application:XML Packet Routing • Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] • XML content routing [Snoeren et al.01] • SOAP Message routing in Application Servers XML Toolkit
<doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Packet Routing <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Toolkit
XPath expressions /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” Output XML Streams Input XML Stream <bib> <book> ... </bib> <bib> <book> ... </bib> XML Toolkit
The XML Stream Processing Problem • Given: • A set of XPath expressions • An Incoming stream of XML documents • Decide: • For each document which expressions it matches Hard: Large number of XPath expressions e.g. 103 - 106 Streaming XML data, high throughput e.g. 5MB/s Easy: Shallow XML data e.g. depth=20 Short XPath expressions XML Toolkit
The Approaches Basic techniques • NFA plus optimizations: • Xfilter/Yfilter [Altinel&Franklin’00] • XTrie [Chan et al.02] • DFA: • XML Toolkit Beyond the obvious • Stream indexes (XML Toolkit) • Stream views XML Toolkit
Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit
e e * catalog price product category quantity * "tools" price 200 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) XML Toolkit
NFA . . . . . . Current states SAX events Basic NFA Evaluation XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit
Basic NFA Evaluation Properties: Space = linear Throughput = decreases linearly Systems: • XFilter [Altinel&Franklin’99], YFilter. • XTrie [Chan et al.’02] XML Toolkit
Current state SAX events Basic DFA Evaluation DFAs XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit
Basic DFA Evaluation Properties: Throughput = constant ! Space = GOOD QUESTION System: • XML Toolkit [University of Washington]http://xmltk.sourceforge.net XML Toolkit
XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu XML Toolkit
Motivation • Lots of data sits in large text files • ad hoc data formats • “Queried” with Unix command line tools • grep, sort, tail, etc • Would be nice to XML-ize it... • ...but then the Unix command line tools won’t work any more. XML Toolkit
Example Text file • In the old Unix world… score decision paperID title • accept P054 “Theory of XML parsing” • reject P021 “Experience with an XML optimizer” • accept P069 “Towards a unified theory of data models” • . . . . . . • Find the top ten rejected papers (in score order): grep “reject” papers.txt | sort | tail 10 XML Toolkit
Example (cont’d) • In the new XML world… • <submissions> • <paper> • <score> 6 </score> • <decision> accept </decision> • <paperID> P054 <paperID> • <title>Theory of XML parsing </title> • </paper> • <paper> • <score> 3 </score> • <decision> reject </decision> • <paperID> P021 </paperID> • <title> Experience with an XML optimizer </title> • </paper> • . . . . . … can’t use those tools anymore XML Toolkit
Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected <paper>s, in <score> order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10 XML Toolkit
Goals of the XML Toolkit Simple, scalable tools for XML processing • Provides service: there are people who need this • Provides a research platform: for XML stream processing XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
The Tools Current tools: • xsort • xagg • xnest • xflatten • xdelete • xpair • xhead • xtail • file2xml • xmill Will talk only about this May look plenty, but actually still incomplete... XML Toolkit
XSort: Definition General form -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)* XML Toolkit
XSort c c c c c c e2 e5 e1 e2 e6 e4 e7 e5 e1 e3 e6 e7 e4 e3 e9 e8 e9 e8 XSort: Definition XML Toolkit
XSort Examples Examples illustrated on data like this: <bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper> . . . . . XML Toolkit
XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the <paper>s, by <title> The <book>s are dropped from the output Compare to… <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . </bib> xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text() XML Toolkit
XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the <author>s, by <lastName> then <firstName> <bib> <author> . . . </author> <author> . . . </author> . . . . . </bib> XML Toolkit
XSort: Examples xsort –c /bib –e paper –e article –e book –e * <paper>s first, then <article>s, then <book>s, then all the rest <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . . </bib> XML Toolkit
XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In <paper>s list the <author>s first; in <book>s list the <title> first; Leave other entries unchanged XML Toolkit
XSort: Implementation • Sorts one context at a time, copies the rest • For each context: • Create a “global key” for each item • Sort items, with a two-pass, multiway merge sort • Quote from Databases 101 (news from the trenches): • with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes ! XML Toolkit
XSort: Performance xsort –c /dblp –e * –k title/text() 1GB ! 8minutes XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
The XPath Processor Common to all tools is the following problem: Given: • Set of correlated XPath expressions • Stream of SAX events Decide: • When are the expressions true variable events XML Toolkit
The XPath Processor How we did it: • All Xpath expressions Deterministic Finite Automaton • Restriction: no predicates yet (current work...) • Does this scale to many, many XPath expressions ? • Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) • Evaluation time is = parsing time • Can do even better with a Stream IndeX (next) XML Toolkit
News: The parser isthe main bottleneckin XPath streamprocessing ! Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets XML Toolkit
Stream IndeX (SIX): Construction XML SIX <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib> XML Toolkit
Skip Parsing Skip Parsing Stream IndeX (SIX): Skip Parsing XPath XML /bib/paper/title. . . <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . . </bib> XML Toolkit
Stream IndeX (SIX) in XML Stream Processing SIX (E.g. DIME) <bib> <book> ... </bib> <bib> <book> ... </bib> <bib> <book> ... </bib> XML The SIX stream is about 6% of the data stream And can be made MUCH smaller XML Toolkit
Outline • The tools • The XPath processing engine • Conclusions XML Toolkit
Conclusions • The toolkit is already available: • http://www.cs.washington.edu/homes/suciu/XMLTK • http://xmltk.sourceforge.net • What it does so far it does very well: • Sorting, aggregation, nest/unnest • But doesn’t do too much: • Restrictedselections, no projections, no restructurings yet • Volunteers welcome ! • Can one process XML data without parsing it completely ? • SIX XML Toolkit