170 likes | 306 Views
Processing XML Streams with Deterministic Automata. Denis Mindolin Gaurav Chandalia. Introduction. XML data stream. XPath query 1. XPath query 2. XML Stream Router. XPath query 3. Consumer 1. Consumer 2. Consumer 3. Related Work.
E N D
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia
Introduction XML data stream XPath query 1 XPath query 2 XML Stream Router XPath query 3 Consumer 1 Consumer 2 Consumer 3
Related Work • The problem was introduced in [Altinel and Franklin 2000] for a system XFilter. • [Chan et al. 2002] describes techniques to solve the problem based on a trie (XTrie) • [Diao et al. 2003] discusses a method based on optimized NFAs(YFilter) • [Green et al. 2003] introduces how to solve the problem using lazy DFA
DFA approach in general • Convert the set of XPath expressions into the set of NFA’s • Convert the set of NFA’s into a single NFA • Convert the single NFA into a DFA • Process XML data stream with DFA (using SAX model)
DFA approach in general (cont) • Linear XPath expression: P ::= /N | //N | PP N ::= E | A | * | text() | text() = S where E – element label A – attribute label / - child axis // - descendant axis * - wild card S – constant string What about predicates? To be decomposed into linear XPath expressions
DFA approach in general (cont) • Consider two XPath expressions /datasets/dataset[//tableHead//*/text()=“Galaxy”]/title /datasets/dataset[/history]/tableHead[/field] • Corresponding query tree $D IN $R/datasets/dataset $H IN $D/history $T IN $D/title sax f = true $TH IN $D/tableHead sax f = true $N IN $D//tableHead//* $F IN $TH/field $V IN $N/text()="Galaxy"
Conversion of XPath expressions into NFA and DFA Query tree Query NFA Query DFA $X IN $R/a $Y IN $X//*/b $Z IN $X/b/* $U IN $Z/d
Eager DFA vs. Lazy DFA • DFA is eagerif it is obtained by the standard algorithm of conversion of NFA to DFA [Hopcroft and Ullman 1979] • DFA is lazy if it is constructed at run-time on demand. Initially it has a single state and whenever we attempt to make a transition into a missing state we compute it and update a transition.
Eager DFA • P = p0 // p1 //… // pk • pi = N1 / N2 /… / Nni k = # of //’s ni= length of pi, i=0,…,k m= max # of *’s in each pi n= length (or depth) of P, i.e. s= alphabet size || Theorem. Given a linear XPath expression P, define prefix(P) = n0, and body(P) = when k>0, and body(P) = 1 when k = 0. Then eager DFA for P has at most prefix(P) + body(P) states. In particular, if m = 0 and k 1, then DFA has at most (n+1) states.
Lazy DFA. Example DFA Queries 1 \a\\*\b \a\b\*\d a 2 Sample XML document * b * * <a> 7 3 <b> * * b <b> * b 6 <d/> b 4 d 8 </b> d b </b> b * </a> 5 b
Lazy DFA Graph schema (based on DTD) d – the maximum number of simple cycles that a simple path can intersect D – the total number of nonempty, simple paths starting at the root d = 2, D = 13
Lazy DFA (cont) • Theorem. Consider a graph schema with d, D, and let Q be set of XPath expressions of maximum depth n. Then on any XML input satisfying the schema, the lazy DFA has at most 1 + D(1+n)d states • Corollary. The number of states of lazy DFA does not depend on the number of XPath expressions, only on their depth. • If n = 10, and the number of XPath expressions is equal to 100,000. • Eager DFA may have 2100,000 states • Lazy DFA will have 1574 states
Lazy DFA. Implementation • To process XML stream, it uses SAX model • The subset of XPath considered in the implementation • No text() and attribute values tests • Only child and descendant axes • All predicates of a query must fire before the target element
Restrictions of the implementation XPath queries Sample XML document 1. All predicates fire before the target element <courses> <course>367-203</course> <title>MEDIA WORKSHOP</title> <level>U</level><section> <section>Se 101</section> <days>T</days> <hours> <start>1:30pm</start> <end>5:20pm</end> </hours></section> <credits>1-3</credits> </courses> \\courses[level]\section 2. Predicates fire between the starting and closing tags of the target element \\courses[days]\section 3. Predicates fire after the target element \\courses[credits]\section
Processing attributes When processing a stream, all attributes are converted into elements <section_listing> <section name=“Se 101“ description=“”/> <hours start="1:30pm“ end="5:20pm"/></section_listing> <section_listing> <section> <@name>Se 101</@name> <@description/> </section> <hours> <@start>1:30pm</@start> <@end>5:20pm</@end> </hours> </section_listing>
Testing • Reference implementation: Galax 1.0.3.5 • Testing XML stream: World geographic database http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml (1MB) • Maximum XML depth of the stream was 6 • Number of queries was 14 • The depth of queries had a range of 1 to 5 • The number of predicates had a range of 0 to 3 • The depth of predicates had a range of 1 to 4
Reference • Todd J. Green et al, Processing XML Streams with Deterministic Automata and Stream Indexes,, ACM Transactions on Computational Logic, 12/2004 • Altinel, M. and Franklin, M. 2000. Efficient filtering of XML documents for selective dissemination, In Proceedings of VLDB. Cairo • Chen J et al, 2000, NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM/SIGMOD Conference on Management of Data • Diao, Y. and Franklin, M. 2003. Query processing for high-volume XML message brokering. In Proceedings of VLDB. Berlin, Germany. • John E. Hopcroft, Jeffrey D. Ullman 1987, Introduction to automata theory, languages, and computation