340 likes | 462 Views
Querying Streaming XML Data. Layout of the presentation. Introduction Common Problems faced Solution proposed Basic Building blocks of the solution How to build up a solution to a given query Features of the system. Streaming XML. XML – standard for information exchange.
E N D
Layout of the presentation • Introduction • Common Problems faced • Solution proposed • Basic Building blocks of the solution • How to build up a solution to a given query • Features of the system
Streaming XML • XML – standard for information exchange. • Some XML documents only available in streaming format. • Streaming is like reading data from a tape drive. • Used in Stock Market, News, Network Statistics. • Predecessor systems used to filter documents.
Structure of an XPath Query • Consists of a Location path and an Output Expression (name). • Location path consists of closure axis(//), node test (book) and predicate (year>2000). • e.g. //book[year>2000]/name
Features of our Approach • Efficient • Easy to understand design. • Design of BPDT is tricky
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Element satisfies the path
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002?
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B Failed price<11. Remove
Common Problems faced • <root> • <pub> • <book id=”1”> • <price> 12.00 </price> • <name> First </name> • <author> A </author> • <price type=”discount”> 10.00 </price> • </book> • <book id=”2”> • <price> 14.00 </price> • <name> Second </name> • <author> A </author> • <author> B </author> • <price type=”discount”> 12.00 </price> • </book> • <year> 2002 </year> • </pub> • </root> Query: /pub[year=2002]/book[price<11]/author Failure?? Element satisfies the path Test passed. But year=2002? Buffer both A & B Failed price<11. Remove Test passed. Output
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Fails year=2002
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Fails year=2002 Passes year=2002
Problems caused by closure axis • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <author> B </author> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name Lets add author. Result? Fails year=2002 Passes year=2002
Handling XML Stream • Input – well formed XML stream. • Use SAX API to parse XML. • Events belong to • Begin = {(a, attrs, d)} • End = {(/a, d)} • Text = {(a, text(), d)} • XML Stream: {e1,e2,…,ei,…} ¦ eiЄ Begin υ End υ Text
Grammar for XPath Queries • Q N+[/O] • N [/¦//] tag [F] • F [FO[OP constant]] • FO @attribute ¦ tag [@attribute] ¦text() • O @attribute ¦text() • OP > ¦≥ ¦ = ¦ < ¦ ≥ ¦ ≠ ¦ contains • XPath query of the form N1N2…Nn/O • Cant handle Reverse Axis, Positional Functions.
Solution to Query Query: /pub[year=2002]/book[price<11]/author PDA PDT
Basic PushDown Transducer (BPDT) • Similar to PushDown Automata • Actions defined on Transition Arcs • Finite set of states • A Start state • A set of final states • Set of input symbols • Set of Stack symbols
Building a BPDT Query: /pub[year>2000]/book[author]/name/text() Consider location step: /book[author] • Book – Author: Buffer for future: Begin event of Author. • Book – Author: Remove from Buffer: End event of Book. • Book – Author: Output result if predicates true: Begin event of Author.
Basic Building Blocks XPath Expression: /tag[child]
Buffer Operations needed • Enqueue(x): Add x to the end of the queue. • Clear(): Removes all items from the queue. • Flush(): Outputs all items in the queue in FIFO order. • Upload(): Moves all items to the end of the queue of a parent BPDT. • No Dequeue operation needed.
Basic Building Blocks XPath Expression: /tag[@attr=val]
Basic Building Blocks XPath Expression: /tag[text()=val]
Basic Building Blocks XPath Expression: /tag[child@attr=val]
Basic Building Blocks XPath Expression: /tag[child=val]
A sample BPDT Query: /pub[year>2000]
Building a solution HPDT for Query: //pub[year>2000]//book[author]//name/text()
HPDT Structure • Each BPDT in HPDT has: • Position • BPDT POSITION(l,K) :- l = depth of BPDT in HPDT, K = sequence # from right to left • BPDT Position (i-1,k) – has right child BPDT position (i,2k) – connected to NA state • BPDT Position(i-1,k) – has left child BPDT position (I,2k+1) – connected to True state. • BPDT Position (i, 2i – 1) – means predicates in higher level BPDT’s evaluate to true Buffer – potential results Stack – stack of elements (SAX) events Depth Vector
Example Query • <root> • <pub> • <book> • <name> X </name> • <author> A </author> • </book> • <book> • <name> Y </name> • <pub> • <book> • <name> Z </name> • <author> B </author> • </book> • <year> 1999 </year> • </pub> • </book> • <year> 2002 </year> • </pub> • </root> Query: //pub[year=2002]//book[author]//name 3 paths from $1 to $14
Reference • Feng Peng and Sudarshan Chawate. XPath Queries on Streaming Data. In SIGMOD 2003.
Thank You ???