370 likes | 456 Views
On the Memory Requirements of XPath Evaluation over XML Streams. Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center. Preliminaries: XML. x 0. < conference > < name > PODS </ name > < speaker > < name > Josifovski </ name >
E N D
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center
Preliminaries: XML x0 <conference> <name> PODS </name> <speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker> <speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker> </conference> root conference x1 x2 name x6 PODS x3 speaker speaker x8 x4 x5 x7 paper_cnt name paper_cnt name Josifovski 3 1 Fagin
Preliminaries: XPath 1.0 /conference[name =PODS]/speaker[paper_cnt >1]/name Query Document root x0 root conference conference x1 x2 name name speaker x6 = PODS PODS x3 speaker speaker x8 paper_cnt x4 x5 > 1 x7 name paper_cnt name paper_cnt name Josifovski 3 1 Fagin Result: { x7 }
XML Streams XML stream: XML document arriving as a one-way stream • Why XML streams? • For transferring XML between systems • For efficient access to large XML documents • Critical resources: • Memory • Processing time
Streaming XML Algorithms • XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] • X-scan [Ives, Levy, and Weld 00] • XMLTK [Avila-Campillo et al 02] • XTrie [Chan et al 02] • SPEX [Olteanu, Kiesling, and Bry 03] • Lazy DFAs [Green et al 03] • The XPush Machine [Gupta and Suciu 03] • XSQ [Peng and Chawathe 03] • TurboXPath [Josifovski, Fontoura, and Barta 04] • …
Our Results • Space lower bounds for evaluating XPath on XML streams • A streaming XML algorithm • Matches the lower bounds on a large fragment of the language • Uses space sub-linear in the query size rather than exponential in the query size
Related Work • Space complexity of XPath evaluation over non-streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] • Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] • Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
Data Complexity [Vardi 82] (Q,D) Evaluation function of a query Q on document D. Q(D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for Q on worst D. Worst-case data complexity: maxQ (complexity of Q). We characterize the data complexity of Q separately for each Q (not just the worst-case one).
XPath Fragment 1. Queries are subsumption-free Query Query root root conference conference name name name != SIGMOD = PODS != SIGMOD Subsumption-free Not subsumption-free
XPath Fragment (cont.) 2. Queries are univariate Query Query root root conference conference author_cnt paper_cnt author_cnt paper_cnt > 30 < 30 < Univariate Not univariate
XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”
Query Frontier Size Query Definitions: root • Frontier at u: u, its siblings, and the siblings of its ancestors. conference • FrontierSize(Q): size of largest frontier. name speaker = PODS paper_cnt > 1 name Theorem 1: For allqueries Q in the fragment, stream-space(Q) =(FrontierSize(Q)).
Document Recursion Depth Definition: recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Query Q Document D root root x0 //part part x1 x2 number name name x3 part Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space(Q) =(recDepthQ(D)). x7 number x4 name 12 x5 part x4 Compressor x6 number 456
Document Depth Document D root x0 Definition: depth(D): Length of longest root-to-leaf path. part x1 x2 name x3 part Refrigerator x4 name x7 number x5 part x4 Compressor 12 Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space(Q) =(log depth(D)). number x6 456
New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).
Proof of Theorem 1 Theorem 1: For allqueries Q in the fragment, stream-space(Q) =(FrontierSize(Q)). Query root • Fragment: • “subsumption-free” • “univariate” • Conjunctions only • “star-restricted” conference name speaker = PODS paper_cnt > 1 name
Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. Document D Query Q x0 root root conference x1 conference x2 name x6 PODS name x3 speaker speaker speaker = PODS x8 x4 x5 x7 paper_cnt paper_cnt name > 1 paper_cnt name name Josifovski 3 1 Fagin
Main Lemmas Theorem 1: For allqueriesQ in the fragment, stream-space(Q) =(FrontierSize(Q)). Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space(Q) =(FrontierSize(D)). show proof Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q).
One-way Communication Complexity f: (X, Y) Z Alice Bob m y x f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.
Reduction A : streaming algorithm for Q using space S stateA() D Bob Alice stateA() Q(D) Theorem: stream-space(Q) >= CC(Q)
Fooling Set Technique Partitioned document: D, Document suffix Document prefix Definition A set T of partitioned documents is a fooling set for Q if: • All documents in T match Q. • For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q. Theorem: For any fooling set T, CC(Q) =(log |T|).
Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space(Q) =(FS(D)). Document D Query Q x0 root root conference x1 conference x2 name x3 name speaker speaker = PODS PODS x5 paper_cnt x4 paper_cnt > 1 name name 3 Fagin
Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document DS: S = { x2, x5 } Document DS Query Q x0 root root conference x1 conference x2 name x3 name speaker speaker = PODS PODS x5 paper_cnt x4 paper_cnt > 1 name name 3 Fagin
Proof of Lemma 1 (cont) Claim: { DS }S is a subset of Frontier(D) is a fooling set. stream-space(Q) >= log(2FS(D)) = FS(D). Proof of Claim: 1. For all S, DS matches Q. 2. If S T, need: either DST or DTS does not match Q.
Proof of Claim (example) x0 root Document DT T = { x4,x5 } x0 Document DS S = { x2,x5 } root conference x1 conference x1 x3 speaker x2 x2 name name x3 x5 speaker PODS PODS x4 paper_cnt name x4 x0 root x5 Fagin paper_cnt 3 name Fagin conference x1 3 Document DTS Conference name missing! x3 speaker x4 x5 x4 paper_cnt name name Fagin Fagin 3
Algorithm • Uses the query as an NFA • Based on three global data structures • Pointer array • Validation array • Level array • Matches the lower bounds for a fragment of XPath.
Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... Pointer array with one entry Validation array a F 1 Level array u0 $ u1 /a u3 u2 /c /b
b F 2 c F 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b
b b F F 2 2 c c F F 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ c a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b
b b b F F F 2 2 2 c c c F F T 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ /c c a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b
b b b b F F F F 2 2 2 2 c c c c T F T F 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ b c a /c Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b
b b b b b F F T F F 2 2 2 2 2 c c c c c F F T T T 2 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... /b $ b c a /c Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b
b b b b b a F F F T F T 2 2 1 2 2 2 c c c c c T T F F T 2 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... /b $ b c a /a /c Return TRUE a F 1 u0 $ u1 /a u3 u2 /c /b
Conclusion: our Contributions • Space lower bounds on the instance data complexity of XPath on XML streams: • In terms of Query Frontier Size • In terms of Document Recursion Depth • In terms of Document Depth • A streaming XML algorithm • Matches the lower bounds on a fragment of the language • Does not use finite-state automata
XPath 1.0 /conference/name D Q x0 $ u0 $ x1 C u1 x2 /C N PODS x3 x6 S S u2 /N x7 x8 N P x4 x5 N P Josifovski 3 1 Fagin Result: { x2 }
XPath 1.0 /conference//name D Q x0 $ u0 $ x1 C x2 N u1 /C PODS x3 x6 S S u2 //N x7 x8 N P x4 x5 N P Josifovski 1 Fagin 3 Result: { x2, x4, x7 }
Reduction A : S-space streaming algorithm for Q. r ¸ 1: integer. s3 s5 s6 s1 s2 s4 s0 (r = 6) 1 1 1 2 2 3 3 1 2 D 2 3 3 s1 Bob Alice s2 s3 s4 s5 s6 Q(D) Q(D) Theorem: S ¸ CC(Qr) / r