310 likes | 328 Views
A Unified Model for XQuery Evaluation over XML Data Streams. Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003. Need for Stream Processing. New environment Data sources are everywhere Data requests are everywhere New applications Sensor networks
E N D
A Unified Model for XQuery Evaluation over XML Data Streams Jinhui Jian Hong Su Elke A. Rundensteiner Worcester Polytechnic Institute ER 2003
Need for Stream Processing • New environment • Data sources are everywhere • Data requests are everywhere • New applications • Sensor networks • Analysis of XML web logs • Selective dissemination of XML information (e.g., news)
Token-by-Token access manner Pattern retrieval + Filtering/Restructuring <biditems> <book> <title> Dream Catcher </title> … FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive> timeline Token: not a direct counterpart of a tuple Specific Challenges for XML Streams <biditems> <book year=“2001"> <title>Dream Catcher</title> <author><last>King</last><first>S.</first></author> <publisher>Bt Bound </publisher> <price> 20 </price> </book> …
Two Computation Paradigms • Automata-based [yfilter02, x-scan01, xsm02, xsq03, xpush03…] • Algebraic [niagara00, …] This project intends to integrate both paradigms into one
Automata Paradigm: • Auxiliary structures for: • Buffering data • Evaluating predicates • Restructuring buffered data • … FOR $b in stream(biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> //book/title title 4 * book 1 2 price Text() 3 5 //book //book/price/text()
Tagger Navigate //book, title Selection push-down enabled Tagger Select price < 30 Select price < 30 Navigate //book, title Navigate //book, price Navigate //book, price Algebraic Computation FOR $b in doc (biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> book book book title author publisher price Text Text Text last first Text Text Navigate //book, /title
Observations • Automata paradigm • Good and long studied for pattern retrieval on tokens • Patches needed for complex filtering and restructuring • Algebraic paradigm • Good and long studied for expressing and optimizing query plans on sets oftuples • Tokenized inputs not accommodated yet Either paradigm has deficiencies Both patterns complement each other
Research Challenges • How to integrate the two models? • How to optimize a query within the integrated query model?
Raindrop Approach:Uniform Modeling in an Algebraic Framework
Uniform Algebraic Plan Query answer Algebraic Plan XML data stream
Uniform Algebraic Plan Tuple-based plan Query answer Tuple stream Token-based plan (automata plan) XML data stream
Modeling the Automata in Algebraic Plan:Black Box[xscan] vs. White Box FOR $b in stream(biditems.xml) //book LET $p = $b/price/text(), $t = $b/title WHERE $p < 30 RETURN <Inexpensive>$t</Inexpensive> $b := //book $p := $b/price $t := $b/title SJoin //book Xscan Extract //book/price Extract //book/title White Box Black Box
A Unified Process at the Logical View FOR $b in doc (biditems.xml) //book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return <Inexpensive> $t </Inexpensive> Tuple-based plan Token-based plan (automata plan)
SJoin //book Extract $p, //book/price Extract $t, //book/title A Unified Process at the Logical View FOR $b in doc (biditems.xml) //book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return <Inexpensive> $t </Inexpensive> Tuple-based plan
Navigate //book, //book/title Select //book/price >5 0 SJoin //book Extract //book/price Extract //book/title A Unified Process at the Logical View FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive>
The Algebra Core Relational-like XML-Specific SJ
Extract Operator Extract //book/title * book title 1 1 2 <bib> <book> <title> Dream Catcher </title> … </book>…
Structural Join Operator FOR $b in doc (biditems.xml)//book LET $p := $b/price/text() $t := $b/title WHERE $p < 30 Return<Inexpensive>$t</Inexpensive> SJoin //book Extract //book/title Extract //book/price * title 3 book 1 2 price 4 <biditems> <book> <title> Dream Catcher </title> … </book>…
In or Out? Tuple-based Plan Query answer Pattern retrieval Tuple stream Token-based plan (automata plan) XML data stream
Plan Alternatives Tagger Tagger Navigate book/title Select price < 30 Select price<30 Navigate /price SJoin //book Extract //book Extract //book/title Extract //book/price The pull-out plan The push-in plan
<book>…… </book> <title>…</title> <price>…</price> <book>…… </book> <title>…</title> <price>…</price> <book>…… </book> <title>…</title> <book>…… </book> <title>…</title> <book year=“2001"> <title>Dream Catcher</title> <author> <last> King </last> <first> S. </first> </author> <publisher> Bt Bound </publisher> <price> 20 </price> </book> <book>…… </book> * title 4 book 1 2 price * book 3 1 2 Out of Automata(/title, /price) Pattern Retrieval Alternatives SJ t2 t2 t10 t10 In Automata (/title, /price)
Experiment: Selectivity = 5% Selectivity = 90%
0,0,0 *r=er|r++ *r=sr|r++ *r=<a>|w(x,sx),w(x,<a>),r++,x”++ 1,0,0 *r!=<a>|r++ *r=</a>|w(x,</a>),w(x,ex),r++,xs=x 2,1,0 *r!=</a>&*r!=</b>|w(x,*r),r++,x”++ *r=<b>|w(x,<b>),r++ 2,2,1 *true|xm=x’, w(o,<res>),w(o,<b>),x’++ !AE(x”)&*x”=ex|xs=x” 2,2,2 *r!=</a>&*r!=</b>|w(x,*r),w(o,*r),x”++,r++ *r=</b>|w(x,</b>),w(o,</b>),r++,x”++ 2,1,3 AE(x’)&*r!=</a>|w(x,*r),w(o,*r),r++,x”++ !AE(x’)&*x’!=ex|w(o,*x’),x’++ AE(x’)&*r=</a>|w(x,</a>),w(o,</a>),w(x,ex),r++,x’++ 1,1,3 !AE(x’)&x’!=ex|w(o,*x’),x’++ !AE(x”)&x”=</b>|w(o,</b>),x”++ 1,2,2 1,1,0 !AE(x”)&*x”!=</b>|w(o,*x”),x”++ !AE(x”)&*x”!=<b>&*x”!=ex|x”++ !AE(x”)&*x”=<b>|x”++ 1,2,1 True|xm=x’,w(o,<res>),w(o,<b>),x’++ Camp 1: Complete Automata Model [XSQ, XSM, XPush] For $x in $R/a return for $Y in $X/b return <res>$Y, $X </res>
Camp 1: Complete Automata Model [XSQ, XSM, XPush] • All details are presented on the same level (and low level!) • Hard to understand • Not suitable for optimizing at different levels • Little has been studied for using automata as query processing paradigm
$b $p $t Camp 2: Automata-Algebra Loosely Coupled Model [Tukwila, YFilter] • Fixed interface for automata computation (all pattern retrieval pushed down) • No opportunity of pushing/pulling computation into/from automata • Bloated, black box operator • Algebraic rewriting impossible for internal optimization Automata Plan $b := //book $p := //book/price $t := //book/title
Contributions • Combining automata and algebra leads to a powerful query processing model • Modeling: • Uniform, simple logical view – better understandability • Optimization: • Uniform rewriting – more optimization opportunities (e.g., pushin/pullout) • Optimization necessity is verified by experiments
http://davis.wpi.edu/dsrg/raindrop/ Project Overview Publications Talks Email: suhong@cs.wpi.edu
Experiment 2 Number of patterns = 2 Number of patterns = 20