From Searching Text to Querying XML Streams

From Searching Text to Querying XML Streams Dan Suciu www.cs.washington.edu/homes/suciu XML Toolkit

About Me • Born 1957, Romania • BS: Bucharest, PhD: University of Pennsylvania • Now: University of Washington (Seattle) My work is on semistructured data • Book: Data on the Web: From relations, to semistructured data and XML Past/present projects: • XML-QL = precursor of XQuery • XMill = the XML compressor • XML toolkit XML Toolkit

Motivation • Text databases • Studied over the past 15 years • Traditional client/server model • Struggled with lack of standard text syntax • Recently, new standard: XML • Traditional client/server: in today’s dbms • New applications: stream processing • This talk: processing stream XML data • My motivation: work on the XML Toolkit project XML Toolkit

Outline • Background • The XML stream processing problem • Basic XML processing with automata • Adapting automata to XML • Stream indexes • Conclusions XML Toolkit

Background:Relational Databases • Structured, stored in tables • Schema separate from data • Queries: precise, refer to schema and data (SQL) Hard to publish, easy to query precisely XML Toolkit

Background:Text Databases • Unstructured, stored in documents • No schema, only data • Queries: imprecise, refer to data only (keywords) Foundations of Databases, Abiteboul (FR), Hull (USA), Vianu (USA) Addison Wesley, 1995 Data on the Web Abiteoul (FR), Buneman (UK), Suciu (USA) Morgan Kaufmann, 1999 Easy to publish, hard to query precisely XML Toolkit

Background:XML Data • Semistructured • Schema and data are together: self-describing • Queries: precise, refer to schema and data (SQL) • <bib> • <book> <title> Foundations… </title> • <author> <name> Abiteboul </name> • <country> FR </country> • </author> • <author> <name> Hull </name> • <country> USA </country> • </author> • <author> <name> Vianu </name> • <country> USA </country> • </author> • <publisher> Addison Wesley </publisher> • <year> 1995 </year> • </book> • … • </bib> XML: Easier to publish,easy to query precisely XML Toolkit

Background:XML Data Data model = tree bib paper book book title author journal title author author publisher author Addison Wesley name country Data on the Web name country Buneman UK Abiteboul FR XML Toolkit

Background:XML Data • Querying with XPath (and XQuery) • This talk: XPath queries restricted to: tag / // * [ ] path=“constant” XML Toolkit

Background:XPath in One Slide tag, / /bib/book/author/name //,* Navigate partially known structure /bib/book//name/*/zip Conjunctivequeries ala SQL /bib/book[author/name=“Abiteboul”] [ ] /bib/book/[year=“1995” and author[name=“Abiteboul” and country=“FR”]] This is precisely the “region algebra” E.g. use proximal nodes [Navarro&Baeza-Yates’97] XML Toolkit

Main Application:XML Packet Routing • Selective Dissemination of Information [Altinel&Franklin’00, Chan et al.02] • XML content routing [Snoeren et al.01] • SOAP Message routing in Application Servers XML Toolkit

<doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Packet Routing <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> <doc> <tag> value </tag> </doc> XML Toolkit

XPath expressions /bib/book /publisher=“MK” /bib/book [category=“recent”]/title =“Web” /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” Output XML Streams Input XML Stream <bib> <book> ... </bib> <bib> <book> ... </bib> XML Toolkit

The XML Stream Processing Problem • Given: • A set of XPath expressions • An Incoming stream of XML documents • Decide: • For each document which expressions it matches Hard: Large number of XPath expressions e.g. 103 - 106 Streaming XML data, high throughput e.g. 5MB/s Easy: Shallow XML data e.g. depth=20 Short XPath expressions XML Toolkit

The Approaches Basic techniques • NFA plus optimizations: • Xfilter/Yfilter [Altinel&Franklin’00] • XTrie [Chan et al.02] • DFA: • XML Toolkit Beyond the obvious • Stream indexes (XML Toolkit) • Stream views XML Toolkit

e e * catalog price product category quantity * "tools" price 200 From XPath to NFA /catalog/product[category="tools"][*/price = 200]/quantity //price Extra processing needed to combine branches (not in this talk) XML Toolkit

NFA . . . . . . Current states SAX events Basic NFA Evaluation XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit

Basic NFA Evaluation Properties: Space = linear Throughput = decreases linearly Systems: • XFilter [Altinel&Franklin’99], YFilter. • XTrie [Chan et al.’02] XML Toolkit

Current state SAX events Basic DFA Evaluation DFAs XPath /bib/book /publisher=“MK” /bib/book [category=“recent”]/title /bib/book //address//*/zip=“123” /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“123” /bib/book /address /field=“567” /bib/book /tag=“some” /bib/book [category=“recent”]/title /bib/book //address//*=“Seattle" /bib/book //address//*="Galaxy" /bib/book /category=“recent” /bib/book /address=“Lisbon” /bib/book /address /field=“some” . . . . . . . . . /bib/book/publisher=“AW” /bib/book [category=“recent”]/title /bib/book //address//*=“123” /bib/book //address//*="Galaxy" /bib/book /category=“new” /bib/book /address=“London” /bib/book /address /field =“some” /bib/book/category =“old” <bib> <book> ... </bib> XML Toolkit

Basic DFA Evaluation Properties: Throughput = constant ! Space = GOOD QUESTION System: • XML Toolkit [University of Washington]http://xmltk.sourceforge.net XML Toolkit

XMLTK: An XML Toolkit for Scalable XML Stream Processing I. Avila-Campillo, T.J. Green, A. Gupta, M. Onizuka, D. Raven, D. Suciu XML Toolkit

Motivation • Lots of data sits in large text files • ad hoc data formats • “Queried” with Unix command line tools • grep, sort, tail, etc • Would be nice to XML-ize it... • ...but then the Unix command line tools won’t work any more. XML Toolkit

Example Text file • In the old Unix world… score decision paperID title • accept P054 “Theory of XML parsing” • reject P021 “Experience with an XML optimizer” • accept P069 “Towards a unified theory of data models” • . . . . . . • Find the top ten rejected papers (in score order): grep “reject” papers.txt | sort | tail 10 XML Toolkit

Example (cont’d) • In the new XML world… • <submissions> • <paper> • <score> 6 </score> • <decision> accept </decision> • <paperID> P054 <paperID> • <title>Theory of XML parsing </title> • </paper> • <paper> • <score> 3 </score> • <decision> reject </decision> • <paperID> P021 </paperID> • <title> Experience with an XML optimizer </title> • </paper> • . . . . . … can’t use those tools anymore  XML Toolkit

Example (con’d) Doing it with the XML Toolkit: Finds top ten rejected <paper>s, in <score> order xsort –c /submissions –e paper[decision/text()=“reject”] –k score/text() papers.xml| xtail –c /submissions –e paper –n 10 XML Toolkit

Goals of the XML Toolkit Simple, scalable tools for XML processing • Provides service: there are people who need this • Provides a research platform: for XML stream processing XML Toolkit

Outline • The tools • The XPath processing engine • Conclusions XML Toolkit

The Tools Current tools: • xsort • xagg • xnest • xflatten • xdelete • xpair • xhead • xtail • file2xml • xmill Will talk only about this May look plenty, but actually still incomplete... XML Toolkit

XSort: Definition General form -c = the context, i.e. where to sort -e = the item, i.e what to sort -k = the key, i.e. what to sort on xsort (–c XPathExpr (-e XPathExpr (-k XPathExpr)*)*)* XML Toolkit

XSort c c c c c c e2 e5 e1 e2 e6 e4 e7 e5 e1 e3 e6 e7 e4 e3 e9 e8 e9 e8 XSort: Definition XML Toolkit

XSort Examples Examples illustrated on data like this: <bib> <book> <author>Elliotte Rusty Harold</author> <author>W. Scott Means</author> <title>XML in a Nutshell</title> <publisher>O'Reilly</publisher> <year>2001</year> <isbn>0-596-00058-8</isbn> </book> <paper> <author>Sylvain Devillers</author> <title>XML and XSLT Modeling for Multimedia Bitstream Manipulation.</title> <year>2001</year> <booktitle>WWW Posters</booktitle> <ee>http://www10.org/cdrom/posters/1112.pdf</ee> <url>db/conf/www/www2001p.html#Devillers01</url> </paper> . . . . . XML Toolkit

XSort: Examples xsort –c /bib –e paper –k title/text() Sorts the <paper>s, by <title> The <book>s are dropped from the output Compare to… <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . </bib> xsort –c /bib –e * –k title/text() xsort –c /bib –e paper –k title/text() –e book –k title/text() XML Toolkit

XSort: Examples xsort –c /bib –e paper/author –k lastName/text() –k firstName/text() Sorts the <author>s, by <lastName> then <firstName> <bib> <author> . . . </author> <author> . . . </author> . . . . . </bib> XML Toolkit

XSort: Examples xsort –c /bib –e paper –e article –e book –e * <paper>s first, then <article>s, then <book>s, then all the rest <bib> <paper> . . . </paper> <paper> . . . </paper> . . . . . <article> . . . </article> . . . . . <book> . . . </book> . . . . . </bib> XML Toolkit

XSort: Examples xsort –c /bib/* –e author –e title –e year –e * Normalize all entries: <author>s first, then <title>s, then <year>sthen all the other elements xsort –c /bib/paper –e author –e * –c /bib/book –e title –e * In <paper>s list the <author>s first; in <book>s list the <title> first; Leave other entries unchanged XML Toolkit

XSort: Implementation • Sorts one context at a time, copies the rest • For each context: • Create a “global key” for each item • Sort items, with a two-pass, multiway merge sort • Quote from Databases 101 (news from the trenches): • with disk blocks of 4KB and 128MB of main memory, one can sort files up to 4TB in two passes ! XML Toolkit

XSort: Performance xsort –c /dblp –e * –k title/text() 1GB ! 8minutes XML Toolkit

The XPath Processor Common to all tools is the following problem: Given: • Set of correlated XPath expressions • Stream of SAX events Decide: • When are the expressions true  variable events XML Toolkit

The XPath Processor How we did it: • All Xpath expressions  Deterministic Finite Automaton • Restriction: no predicates yet (current work...) • Does this scale to many, many XPath expressions ? • Yes, if we compute the DFA lazily (upcoming ICDT’2003 paper) • Evaluation time is = parsing time • Can do even better with a Stream IndeX (next) XML Toolkit

News: The parser isthe main bottleneckin XPath streamprocessing ! Stream IndeX (SIX) Solution: “Index” the XML stream, parse only partially Definition: The SIX = a table of (start, end) offsets XML Toolkit

Stream IndeX (SIX): Construction XML SIX <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book> </bib> XML Toolkit

Skip Parsing Skip Parsing Stream IndeX (SIX): Skip Parsing XPath XML /bib/paper/title. . . <bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book><paper>. . . . . . </bib> XML Toolkit

Stream IndeX (SIX) in XML Stream Processing SIX (E.g. DIME) <bib> <book> ... </bib> <bib> <book> ... </bib> <bib> <book> ... </bib> XML The SIX stream is about 6% of the data stream And can be made MUCH smaller XML Toolkit

XML Toolkit

Conclusions • The toolkit is already available: • http://www.cs.washington.edu/homes/suciu/XMLTK • http://xmltk.sourceforge.net • What it does so far it does very well: • Sorting, aggregation, nest/unnest • But doesn’t do too much: • Restrictedselections, no projections, no restructurings yet • Volunteers welcome ! • Can one process XML data without parsing it completely ? • SIX XML Toolkit

From Searching Text to Querying XML Streams

From Searching Text to Querying XML Streams

Presentation Transcript

Querying XML

Querying XML

Querying XML

Querying XML

5 Querying XML

Querying Structured Text

Lecture 15: Querying XML

Querying and Storing XML

Querying and Storing XML

Text Searching

XML Querying and Views

Querying XML Views

Querying XML streams in DB2

Querying Structured Text in an XML Database

7 Querying XML

Querying XML Documents