1 / 37

On the Memory Requirements of XPath Evaluation over XML Streams

On the Memory Requirements of XPath Evaluation over XML Streams. Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center. Preliminaries: XML. x 0. < conference > < name > PODS </ name > < speaker > < name > Josifovski </ name >

phuong
Download Presentation

On the Memory Requirements of XPath Evaluation over XML Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

  2. Preliminaries: XML x0 <conference> <name> PODS </name> <speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker> <speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker> </conference> root conference x1 x2 name x6 PODS x3 speaker speaker x8 x4 x5 x7 paper_cnt name paper_cnt name Josifovski 3 1 Fagin

  3. Preliminaries: XPath 1.0 /conference[name =PODS]/speaker[paper_cnt >1]/name Query Document root x0 root conference conference x1 x2 name name speaker x6 = PODS PODS x3 speaker speaker x8 paper_cnt x4 x5 > 1 x7 name paper_cnt name paper_cnt name Josifovski 3 1 Fagin Result: { x7 }

  4. XML Streams XML stream: XML document arriving as a one-way stream • Why XML streams? • For transferring XML between systems • For efficient access to large XML documents • Critical resources: • Memory • Processing time

  5. Streaming XML Algorithms • XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] • X-scan [Ives, Levy, and Weld 00] • XMLTK [Avila-Campillo et al 02] • XTrie [Chan et al 02] • SPEX [Olteanu, Kiesling, and Bry 03] • Lazy DFAs [Green et al 03] • The XPush Machine [Gupta and Suciu 03] • XSQ [Peng and Chawathe 03] • TurboXPath [Josifovski, Fontoura, and Barta 04] • …

  6. Our Results • Space lower bounds for evaluating XPath on XML streams • A streaming XML algorithm • Matches the lower bounds on a large fragment of the language • Uses space sub-linear in the query size rather than exponential in the query size

  7. Related Work • Space complexity of XPath evaluation over non-streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] • Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] • Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

  8. Data Complexity [Vardi 82] (Q,D) Evaluation function of a query Q on document D. Q(D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for Q on worst D. Worst-case data complexity: maxQ (complexity of Q). We characterize the data complexity of Q separately for each Q (not just the worst-case one).

  9. XPath Fragment 1. Queries are subsumption-free Query Query root root conference conference name name name != SIGMOD = PODS != SIGMOD Subsumption-free Not subsumption-free

  10. XPath Fragment (cont.) 2. Queries are univariate Query Query root root conference conference author_cnt paper_cnt author_cnt paper_cnt > 30 < 30 < Univariate Not univariate

  11. XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”

  12. Query Frontier Size Query Definitions: root • Frontier at u: u, its siblings, and the siblings of its ancestors. conference • FrontierSize(Q): size of largest frontier. name speaker = PODS paper_cnt > 1 name Theorem 1: For allqueries Q in the fragment, stream-space(Q) =(FrontierSize(Q)).

  13. Document Recursion Depth Definition: recDepthQ(D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Query Q Document D root root x0 //part part x1 x2 number name name x3 part Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space(Q) =(recDepthQ(D)). x7 number x4 name 12 x5 part x4 Compressor x6 number 456

  14. Document Depth Document D root x0 Definition: depth(D): Length of longest root-to-leaf path. part x1 x2 name x3 part Refrigerator x4 name x7 number x5 part x4 Compressor 12 Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space(Q) =(log depth(D)). number x6 456

  15. New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).

  16. Proof of Theorem 1 Theorem 1: For allqueries Q in the fragment, stream-space(Q) =(FrontierSize(Q)). Query root • Fragment: • “subsumption-free” • “univariate” • Conjunctions only • “star-restricted” conference name speaker = PODS paper_cnt > 1 name

  17. Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. Document D Query Q x0 root root conference x1 conference x2 name x6 PODS name x3 speaker speaker speaker = PODS x8 x4 x5 x7 paper_cnt paper_cnt name > 1 paper_cnt name name Josifovski 3 1 Fagin

  18. Main Lemmas Theorem 1: For allqueriesQ in the fragment, stream-space(Q) =(FrontierSize(Q)). Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space(Q) =(FrontierSize(D)). show proof Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q).

  19. One-way Communication Complexity f: (X, Y)  Z Alice Bob m y x f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

  20. Reduction A : streaming algorithm for Q using space S stateA()     D Bob Alice stateA() Q(D) Theorem: stream-space(Q) >= CC(Q)

  21. Fooling Set Technique Partitioned document: D,   Document suffix Document prefix Definition A set T of partitioned documents is a fooling set for Q if: • All documents in T match Q. • For any two distinct documents D,, D, in T, either D, does not match Q or D, does not match Q. Theorem: For any fooling set T, CC(Q) =(log |T|).

  22. Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space(Q) =(FS(D)). Document D Query Q x0 root root conference x1 conference x2 name x3 name speaker speaker = PODS PODS x5 paper_cnt x4 paper_cnt > 1 name name 3 Fagin

  23. Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document DS: S = { x2, x5 } Document DS Query Q x0 root root conference x1 conference x2 name x3 name speaker speaker = PODS PODS x5 paper_cnt x4 paper_cnt > 1 name name 3 Fagin

  24. Proof of Lemma 1 (cont) Claim: { DS }S is a subset of Frontier(D) is a fooling set. stream-space(Q) >= log(2FS(D)) = FS(D). Proof of Claim: 1. For all S, DS matches Q. 2. If S  T, need: either DST or DTS does not match Q.

  25. Proof of Claim (example) x0 root Document DT T = { x4,x5 } x0 Document DS S = { x2,x5 } root conference x1 conference x1 x3 speaker x2 x2 name name x3 x5 speaker PODS PODS x4 paper_cnt name x4 x0 root x5 Fagin paper_cnt 3 name Fagin conference x1 3 Document DTS Conference name missing! x3 speaker x4 x5 x4 paper_cnt name name Fagin Fagin 3

  26. Algorithm • Uses the query as an NFA • Based on three global data structures • Pointer array • Validation array • Level array • Matches the lower bounds for a fragment of XPath.

  27. Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... Pointer array with one entry Validation array a F 1 Level array u0 $ u1 /a u3 u2 /c /b

  28. b F 2 c F 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b

  29. b b F F 2 2 c c F F 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ c a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b

  30. b b b F F F 2 2 2 c c c F F T 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ /c c a Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b

  31. b b b b F F F F 2 2 2 2 c c c c T F T F 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... $ b c a /c Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b

  32. b b b b b F F T F F 2 2 2 2 2 c c c c c F F T T T 2 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... /b $ b c a /c Index 0 a F 1 Index 1 u0 $ u1 /a u3 u2 /c /b

  33. b b b b b a F F F T F T 2 2 1 2 2 2 c c c c c T T F F T 2 2 2 2 2 Algorithm Example Run Query: /a[b and c] Input XML <a> <c>c1</c> <b>b1</b> </a> ... /b $ b c a /a /c Return TRUE a F 1 u0 $ u1 /a u3 u2 /c /b

  34. Conclusion: our Contributions • Space lower bounds on the instance data complexity of XPath on XML streams: • In terms of Query Frontier Size • In terms of Document Recursion Depth • In terms of Document Depth • A streaming XML algorithm • Matches the lower bounds on a fragment of the language • Does not use finite-state automata

  35. XPath 1.0 /conference/name D Q x0 $ u0 $ x1 C u1 x2 /C N PODS x3 x6 S S u2 /N x7 x8 N P x4 x5 N P Josifovski 3 1 Fagin Result: { x2 }

  36. XPath 1.0 /conference//name D Q x0 $ u0 $ x1 C x2 N u1 /C PODS x3 x6 S S u2 //N x7 x8 N P x4 x5 N P Josifovski 1 Fagin 3 Result: { x2, x4, x7 }

  37. Reduction A : S-space streaming algorithm for Q. r ¸ 1: integer. s3 s5 s6 s1 s2 s4 s0 (r = 6) 1 1 1 2 2 3 3 1 2 D 2 3 3 s1 Bob Alice s2 s3 s4 s5 s6 Q(D) Q(D) Theorem: S ¸ CC(Qr) / r

More Related