Michael Schmidt Stefanie Scherzinger Christoph Koch Saarland University Database Group

Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation Michael Schmidt Stefanie Scherzinger Christoph Koch Saarland University Database Group Saarbrücken, Germany 2007 IEEE 23rd International Conference on Data Engineering - April 17, 2007

Outline • I. Streaming XQuery Evaluation • Motivation and Requirements • Desiderata to streaming and in-memory XQuery Engines • Existing Approaches • II. Combining Static and Dynamic Buffer Minimization • Query Normalization • The Concept of Roles • Active Garbage Collection • System Architecture • Optimizations • III. The GCX XQuery Engine • Prototype Implementation • Benchmark Result • IV. Summary

Motivation and Requirements I. • Growing importance of streaming XML processing comes along with the profileration of the WWW • Streams may arrive at very high rates storing incoming data to disk often unfeasible • Main memory DOM tree representation of XML documents very space-consuming buffer management becomes the key prerequisite to performance • Problem becomes even more urgent when evaluating (powerful fragments of) XQuery rather than simple filters on data streams • Streaming techniques very useful for in-memory XQuery enginges

Desiderata for in-memory XQuery Engines I. • Only buffer data that is relevant for query evaluation • Avoid multiple copies of the data in main memory • Do not keep data buffered longer than necessary Claim:Combination of static and dynamic analysis required to satisfy all desiderata

Existing Approaches (1) I. • Only buffer data that is relevant for query evaluation Document Projection • Statical query analysis • Detect parts of the document that are relevant to query evaluation • Project away those parts of the document that are not relevant to query evaluation A. Marian and J. Siméon “Projecting XML Documents” In Proc. VLDB’03, pages 213–224, 2003. S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena “Accelerating Queries by Pruning XML Documents” TKDE, 54(2):211–240, 2005. V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen “Type-Based XML Projection” In Proc. VLDB’06, 2006.

article isbn isbn … … … … … … … Existing Approaches (2) I. Projection Paths Document Projection { /bib/book, /bib/book/author/ dos::node(), /bib/book/price, /bib/book/title/ dos::node() } XQuery <q> { for $b in /bib/book where ($b/author=“A. Turing” and fn:exists($b/price)) return $b/title } </q> bib XML document dos:=descendant-or-self dos:=descendant-or-self book book author price title author price title … … … …

Existing Approaches (3) I. • Avoid multiple copies of the data in main memory • Do not keep data buffered longer than necessary Hard to satisfy both paradigms in combination XQuery Two approaches: (1) Single DOM-tree (2) Buffers for variables <q> { for $x1 in //book return for $x2 in //* return for $x3 in //article return <node/> } </q>

The Big Picture transformation, extraction II. input, output communication input stream Projection Tree Roles Buffer (nodes annotated with roles) Normalized XQuery Rewritten XQuery (role updates) variable bindings role removals, active garbage collection Evaluator XQuery output stream

Query Normalization II. • Rewriting where-expressions to if-statements • Pushing down if-statements <r> { for $b in /bib return ( if (fn:exists($b/book)) then <books> else (), if (fn:exists($b/book)) then $b/book else (), if (fn:exists($b/book)) then </books> else () ) } </r> <r> { for $b in /bib where (fn:exists($b/book)) return <books>{ $b/book }</books> } </r>

Deriving Roles II. <r> { for $bib in /bib return (for $x in $bib/* return if (not(fn:exists($x/price))) then $x else (), for $b in $bib/book return $b/title ) } </r> / /bib /* /book /price[1] dos::node() /title/dos::node()

Assigning Roles II. • Matching document nodes get assigned roles when projected into the buffer • Roles assigned on-the-fly while reading the input • Nodes without roles and role-carrying ancestors need not to be buffered (projection) Roles XML document r1 / r2 /bib r3 /bib/* r4 /bib/*/price[1] r5 /bib/*/dos::node() r6 /bib/book r7 /bib/book/title/dos::node() { r2 } bib { r3, r5, r6 } book { r5, r7 } { r5 } title author

Inserting Role Updates II. <r> { for $bib in /bib return ( for $x in $bib/* return ( if (not(exists($x/price))) then $x else (), signOff($x,r3), signOff($x/price[1],r4), signOff($x/dos::node(),r5) ), for $b in $bib/book return ( $b/title, signOff($b,r6), signOff($b/title/dos::node(),r7))) ), signOff($bib,r2) ) } </r> <r> { for $bib in /bib return (for $x in $bib/* return if (not(fn:exists($x/price))) then $x else (), for $b in $bib/book return $b/title) } </r> r1 / r2 /bib $bib r3 /bib/* $x r4 /bib/*/price[1] $x/price r5 /bib/*/dos::node() $x r6 /bib/book $b r7 /bib/book/title/dos::node() $b/title

Active Garbage Collection II. <r> { for $bib in /bib return ( for $x in $bib/* return ( if (not(exists($x/price))) then $x else (), signOff($x,r3), signOff($x/price[1],r4), signOff($x/dos::node(),r5) ), for $b in $bib/book return ( $b/title, signOff($b,r6), signOff($b/title/dos::node(),r7))) ), signOff($bib,r2) ) } </r> Input stream: <bib> <book> <title/> <author/> </book> … Buffer: {r2} bib {r6} {r3 , r5 , r6} {r5 , r6} book {r5 , r7} {r7} {r5} {} title author Output stream: <r> <book> <title/> <author/> </book>

Optimizations II. • Rewrite path steps to for-expressions • Use aggregated roles • Remove redundant roles <r> { for $bib in /bib return (for $_1 in $bib/book (return $_1/book, signOff($_1/book/dos::node(),r2)), signOff($bib,r1)) } </r> <r> { for $bib in /bib (return $bib/book, signOff($bib,r1), signOff($bib/book/dos::node(),r2)) } </r> <r> { for $bib in /bib return for $_1 in $bib/book return $_1/book } </r> <r> { for $bib in /bib return $bib/book } </r>

The GCX XQuery Engine III. • Garbage Collected XQuery • Implemented in C++ for a fragment of composition-free XQuery • Arbitrary nested single step for-loops • FWR-expressions • Child and descendant axes • Node-tests for tags, wildcards, node(), text() • If-expressions with and, or, not, fn:exists • Let/some-expressions and aggregations not yet supported • No support for attributes (no restriction) • Open Source (Berkeley Software Distribution Licence) • GCX project page: http://www.infosys.uni-sb.de/projects/streams/gcx/index.php • GCX download page: http://www.infosys.uni-sb.de/software/gcx/

Benchmark Results (1) III. • Time and memory consumption • Queries and documents from the XMark Benchmark • Queries and documents modified to match the supported fragment • 3GHz CPU Intel Pentium IV with 2GB RAM • SuSe Linux 10.0, J2RE v1.4.2 for Java-based systems • Time limit: 1 hour • Benchmarks against the following systems • FluX Java in-memory engine for streaming XQuery evaluation. • MonetDB v4.12.0/XQuery v0.12.0 A secondary storage engine written in C++. Loading of the document is included in time measurements. • QizX/open v1.1 Free in-memory XQuery engine written in Java. • Saxon v8.7.1 Free in-memory XQuery engine written in Java.

Benchmark Results (2) III. XMark Q1: Running time (s) <query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else () } </query1>

Benchmark Results (3) III. Memory Consumption (MB) XMark Q1: <query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else () } </query1>

Benchmark Results (4) III. XMark Q8: <query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8>

Benchmark Results (5) III. XMark Q8 Memory Consumption (MB) Running time (s) Failure for 100MB: MonetDB – Failure for 200MB: GCX, FluxQuery, MonetDB

Summary IV. • Combination of static and dynamic buffer minimization • Roles are derived from the XQuery and assigned to matching document nodes in the preprojection phase • XQuery expression statically rewritten: at runtime, signOff-statements cause buffered nodes to lose roles • An active garbage collection mechanism removes nodes from buffers that have lost their last role • Document projection integrated in the role concept • Technique behaves very well for composition-free XQuery w.r.t. execution time and memory consumption • Applicable in streaming contexts, but also useful for common in-memory XQuery engines

Thank you for your attention!

C. Koch, S. Scherzinger, N. Schweikardt, and B. Stegmaier “Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams” In Proc. VLDB’04, pages 228–239, 2004 X. Li and G. Agrawal “Efficient evaluation of XQuery over streaming data” In Proc. VLDB’05, pages 265–276, 2005 A. Marian and J. Siméon “Projecting XML Documents” In Proc. VLDB’03, pages 213–224, 2003 D. Olteanu, H. Meuss, T. Furche, and F. Bry “XPath: Looking Forward” In EDBT 02: Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers, pages 109–127, 2002 D. Olteanu, T. Kiesling, and F. Bry “An Evaluation of Regular Path Expressions with Qualifiers against XML Streams” In Proc. ICDE’03, page 702, 2003 H. Su, E. A. Rundensteiner, and M. Mani “Semantic Query Optimization for XQuery over XML Streams” In Proc. VLDB, pages 277–288, 2005 P. R. Wilson “Uniprocessor Garbage Collection Techniques” In Proc. IWMM’92, pages 1–42, 1992 Z. Bar-Yossef, M. Fontoura, and V. Josifovski “On the Memory Requirements of XPath Evaluation over XML Streams” In Proc. PODS’04, pages 177–188, 2004 M. Benedikt, W. Fan, and F. Geerts “XPath Satisfiability in the Presence of DTDs” In Proc. PODS, pages 25–36, 2005 V. Benzaken, G. Castagna, D. Colazzo, and K. Nguyen “Type-Based XML Projection” In Proc. VLDB’06, 2006 S. Bréssan, B. Catania, Z. Lacroix, Y. G. Li and A. Maddalena “Accelerating Queries by Pruning XML Documents” TKDE, 54(2):211–240, 2005 L. Fegaras, R. Dash, and Y. Wang “A Fully Pipelined XQuery Processor” In XIME-P, 2006 L. Fegaras, D. Levine, S. Bose, and V. Chaluvadi “Query Processing of Streamed XML Data” In Proc. CIKM 2002, pages 126–133, 2002 T. J. Green, G. Miklau, M. Onizuka, and D. Suciu “Processing XML Streams with Deterministic Automata” In Proc. ICDT’03, pages 173–189, 2003 C. Koch “On the complexity of nonrecursive XQuery and functional query languages on complex values” ACM Transactions on Database Systems, 31(4), 2006

Additional Resources

Full Benchmark Results

Benchmark Queries (1) <query1> { for $s in /site return for $p in $s/people return for $pe in $pe/person return if ($pe/person_id="person0") then <result>{ $pe/name }</result> else () } </query1> <query6> { for $site in //site return for $regions in $site/regions return $regions//item} </query6>

Benchmark Queries (2) <query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8>

Benchmark Queries (3) <query13> { for $site in /site return for $regions in $site/regions return for $australia in $regions/australia return for $item in $australia/item return <item> { ( <name> { $item/name } </name>, <desc> { $item/description } </desc> ) } </item>} </query13>

Benchmark Queries (4) <query20> { for $site in /site return for $people in $site/people return for $person in $people/person return if (fn:not(fn:exists($person/person_income))) then $person else ()} </query20>

Buffer Plot (1) <query6> { for $site in //site return for $regions in $site/regions return $regions//item} </query6> Buffer plot for XMark Q6 on 10MB input document According to the DTD: all regions occur at the beginning of the document

Buffer Plot (2) first partition of join partners: persons second partition of join partners: buyers <query8> { for $root in (/) return for $site in $root/site return for $people in $site/people return for $person in $people/person return <item> { ( <person>{ $person/name }</person>, <items_bought> { for $site2 in $root/site return for $cas in $site2/closed_auctions return for $ca in $cas/closed_auction return for $buyer in $ca/buyer return if ($buyer/buyer_person=$person/person_id) then <result> { $ca } </result> else () } </items_bought> ) } </item> } </query8> Buffer plot for XMark Q8 on 10MB input document

Buffer Plot (3) XQuery <r> { for $bib in /bib return (for $x in $bib/* return if (not(exists($x/price))) then $x else (), for $b in $bib/book return $b/title) } </r> 9 x article + 1 x book bib (book|article)* author title price 9 x book + 1 x article

The GCX Runtime Engine Buffer XQuery input stream node lookup garbage collection nodes/roles nextNode() getNext($x/π) Stream Preprojector Buffer Manager Evaluator node/eos node/NULL OK signOff($x/π,r) output stream

System Architecture input stream Stream Preprojector Projection Paths Roles Projection DFA (constructed lazily, assigns roles) Rewritten XQuery (role updates) Normalized XQuery Buffer (nodes & roles) role updates input XQuery Evaluator output stream

Michael Schmidt Stefanie Scherzinger Christoph Koch Saarland University Database Group