Buffering in Query Evaluation over XML Streams

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

XML Document 1: <department> 2: <name> 3: Software Testing 4: </name> 5: <employee id= 1> 6: <name> 7: Alice 8: </name> 9: <position> 10: engineer 11: </position > 12: </employee > 13: <employeeid = 2> 14: <name> 15: Bob 16: </name> 17: <position > 18: engineer 19: </position > 20: </ employee > 21: <employeeid = 3> 22: <name> 23: Carole 24: </name> 25: <position > 26: assistant 27: </position > 28: </employee > 29: <manager id = 4> 30: <name> 31: John 32: </name> 33: </manager> 34: </department>

root department manager name name @id employee 4 John position @id name employee engineer 1 Alice @id employee position name 3 assistant @id position name Carole 2 engineer Bob XML Document Tree Software Testing

XPath Queries [manager/name = “John”] [position = “engineer”] /department /employee /name root department manager name name @id employee 4 John position @id name employee engineer 1 Alice @id employee position name 3 assistant @id position name Carole 2 engineer Bob

XPath Queries [employee/name = manager/name] /department /name root department manager name name @id employee 4 John position @id name employee engineer 1 Alice @id employee position name 3 assistant @id position name Carole 2 engineer Bob

XPath • XPath 2.0 • Forward axes only • Eval(Q,D): nodes in D that match Q • Two modes of XPath evaluation: • Full fledged evaluation: given Q,D, output Eval(Q,D) • Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.

XML Streams • XML stream: sequence of SAX events • startDocument(), endDocument(), startElement(name), endElement(name), text(str), … • Critical resources • Memory • Processing time • Why XML streams? • For transferring XML between systems • For efficient access to large XML documents

Streaming XML Algorithms • XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] • X-scan [Ives, Levy, and Weld 00] • XMLTK [Avila-Campillo et al 02] • XTrie [Chan et al 02] • SPEX [Olteanu, Kiesling, and Bry 03] • Lazy DFAs [Green et al 03] • The XPush Machine [Gupta and Suciu 03] • XSQ [Peng and Chawathe 03] • FluX [Koch el al 04] • TurboXPath [Josifovski, Fontoura, and Barta 05] • … All of them use lots of memory on certain queries & documents

Memory Bottleneck I: Storage of Large Transition Tables • Framework of most algorithms: • Q  NFA • Simulate NFA by DFA • Caveat: exponential blowup • However: exponential blowup is not necessary[Bar-Yossef, Fontoura, Josifovski 04] • Algorithm for filtering XML streams whose space is linear in the query size

Memory Bottleneck II:Buffering of Document Fragments • Scenario 1: buffering nodes, which may or may not be part of the output. /department[manager/name = “John”]/employee[position = “engineer”]/name root department manager name name @id employee 4 John position @id name employee engineer 1 Alice @id employee position name 3 assistant @id position name Carole 2 engineer Bob

Memory Bottleneck II:Buffering of Document Fragments • Scenario 2: buffering nodes needed for evaluating pending predicates. /department[employee/name = manager/name ]/name root department manager name name @id employee 4 John position @id name employee engineer 1 Alice @id employee position name 3 assistant @id position name Carole 2 engineer Bob

Memory Bottleneck II:Buffering of Document Fragments • Scenario 3: buffering multiple candidate matches that are nested within each other. • Relevant only when document is “recursive” • Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

Our Results • Quantitative space lower bounds for: • Full-fledged evaluation of queries with predicates (Scenario 1) • Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2) • Matching upper bound • Eager evaluation of predicates • In all other scenarios: no buffering required • Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

Related Work • Space complexity of XPath evaluation over non-streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] • Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] • Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Document Concurrency • Q: query • D = 1,…,n: document • Each i is an SAX event • t = (1,…,t) • Definition: x  D is alive at step t if x  t and  , s.t. • x  Eval(Q,t) • x  Eval(Q,t) • t-concurrency(D,Q): number of distinct nodes that are alive at step t • concurrency(D,Q): maxt t-concurrency(D,Q)

Lower Bound Notions • A “normal” lower bound: For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. • Q and D may be “pathological” • Doesn’t say much about real-world queries/documents • An “ideal” lower bound: For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. • Too good to be true • A can have D and Q “hard-coded”, and then know the result a priori • Space of A on D and Q = minimum description length of Q and D

Our Lower Bound • Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. • D’ is the same as D, except for a few extra empty nodes with auxiliary names. • Theorem holds only if: • Q is “star-free” • D is non-recursive

Why isn’t this Obvious? • Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL. • Reason 2: • Obvious: If x is alive at step t  A has to buffer x • Because: A may or may not need to output x • Not obvious: If x and y are alive at step t  A has to buffer both • If x and y are not “independent”, maybe it’s enough to buffer just x (or just y)

Proof of Lower Bound • C = t-concurrency(D,Q) • x1,…,xC = distinct nodes alive at step t • Recall: for every xi there exist i and i s.t. • xi Eval(Q, ti) • xi  Eval(Q, ti) • Lemma: there exist a single and a single s.t. for all i, • xi Eval(Q, t) • xi  Eval(Q, t)

Proof of Lower Bound (cont.) • For every S  { 1,…,C } define document DS: • DS is the same as D, except • For every i  S, we “mark” xi • Marking: an extra empty child with an auxiliary name • Note: DS is almost-isomorphic to D • tS = first t events in DS

Proof of Lower Bound (cont.) • A = any algorithm • Consider state of A after processing tS: • If suffix = , none of the xi’s should be output •  A could not have output any xi by step t • If suffix = , no information in suffix about S but S can be reconstructed from output •  state of A at step t must have all information about S • Conclusion: space ≥ (C) • Actual proof: by one-way communication complexity

Conclusions • Our contributions: • Quantitative space lower bounds • Full-fledged evaluation of queries with predicates • Filtering/full-fledged evaluation of queries with “multi-variate” predicates • Matching upper bound • Open problems: • Quantitative lower bounds for XQuery evaluation over streams • Address larger fragments of XPath

Memory Bottleneck II:Buffering of Document Fragments • Scenario 3: buffering multiple candidate matches that are nested within each other. root a //a[b and c] c a c b a b • Relevant only when document is “recursive” • Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

Concurrency: Example /department[manager/name = “John”]/employee[position = “engineer”]/name 1: <department> 2: <name> 3: Software Testing 4: </name> 5: <employee id= 1> 6: <name> 7: Alice 8: </name> 9: <position> 10: engineer 11: </position > 12: </employee > 13: <employeeid = 2> 14: <name> 15: Bob 16: </name> 17: <position > 18: engineer 19: </position > 20: </ employee > 21: <employeeid = 3> 22: <name> 23: Carole 24: </name> 25: <position > 26: assistant 27: </position > 28: </employee > 29: <manager id = 4> 30: <name> 31: John 32: </name> 33: </manager> 34: </department> dead alive alive

Buffering in Query Evaluation over XML Streams