140 likes | 236 Views
Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/. The Joy of SAX. Design Goals. Want to build an XQuery engine based entirely on SAX handlers
E N D
Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/ The Joy of SAX
Design Goals • Want to build an XQuery engine based entirely on SAX handlers • all the way from the points the input documents are read by the SAX parser up to the point the query results are printed • This engine should consist of operators that • naturally reflect the syntactic structures of XQuery and • can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries • The XQuery translation should be concise, clean, and completely compositional • Even though it cannot compete with transducers for simple XPaths, it should not sacrifice much on performance in terms of memory and computational overhead • But, ... it should be able to beat transducers for complex predicates and deeply nested queries
Pull-Based Approach • Based on iterators: class Iterator { Tuple current(); // current tuple from stream void open (); // open the stream iterator Tuple next (); // get the next tuple from stream boolean eos (); // is this the end of stream? } • An iterator reads data from the input stream(s) and delivers data to the output stream • Connected through pipelines • an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) • to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream
What is a Tuple? • A vector of components: • one component for each scoped for-variable • has fixed-size at each point in a pipeline (known at compile time) • doesn't need to include the variable names • A tuple component is the unit of communication between iterators • Passing fully constructed XML elements through iterators is a bad idea for a compositional translation • initially, we would have to pass the entire document as a tree! • The unit of communication should be • a single event or • a fragment (a reference to an XML element in a document) • this requires a structural index for fragments • A proposal for a pull parser: XML Pull Parser 3 www.xmlpull.org • BEA/XQRL token stream & token iterators
Event-Oriented Approach • A tuple in an event-oriented approach consists of a sequence of events, ending with an End-Of-Tuple (EOT) event • Single-node event sequence • depth-first unfolding of a single XML node <start “A”> <start “B”> <text “x”> <end “B”> <start “B”> <text “y”> <end “B”> <end “A”> <text “z”> <start “A”> <start “B”> <text “w”> <end “B”> <end “A”> <EOT> A tuple with 3 components
Element vs Event Granularity Stream unit is a single event abstract class Event {} class Start extends Event { String tag; } class End extends Event { String tag; } class Text extends Event { String text; } class EOT extends Event {} class Child extends Iterator { Iterator input; String tagname; boolean keep = false; int nest = 0; Event next () { while (!input.eos()) { current = input.current(); if (current instanceof Start) { if (nest++ == 1) keep = ((Start) current).tag .equals(tagname); } else if (current instanceof End) if (nest-- == 1) keep = false; input.next(); if (keep) return current; } } } Stream unit is a DOM-like element: abstract class Element {} class Node extends Element { String tag; Element[] sequence; } class Text extends Element { String text; } class Tuple { Element[] components; } class Child extends Iterator { Iterator input; String tagname; int index = 0; Tuple next () { while (!input.eos()) { if (input.current().get(0) instanceof Node) { Node ce = (Node) input.current().get(0); if (index < ce.sequence.length) if (ce.sequence[index] instanceof Node && ((Node) ce.sequence[index]) .tag.equals(tagname)) { current = new Tuple(ce.sequence[index++]); return current; } else index++; else { index = 0; input.next(); } } else { index = 0; input.next(); } } }
For-Loop using Iterators Need a stepper for a for-loop: class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Not a good idea if right reads a document! Loop right right_step left right pipeline Step set class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }
Let-Bindings using Iterators Let-bindings are harder to implement: • the let-value may be a sequence • one producer -- many consumers • we do not want to materialize the let-value in memory queue tail head fastest consumer slowest consumer backlog Some cases are hopeless: let $v:=e return ($v,$v)
Push-based Pipelines • Unit of communication between pipelines: messages rather than events • Pipeline components are SAX-like event handlers • they are instances of Operator subclasses: abstract class Operator { void suspend (); void release (); void startDocument ( int node ); void endDocument ( int node ); Status endTuple ( int node ); Status startElement ( int node, String tag ); Status endElement ( int node, String tag ); Status characters ( int node, String text ); } ('node' identifies a for-variable)
The Child Operator class Child extends Operator { Operator next; String tagname; int nest = 0; boolean keep = false; Status startElement ( int node, String tag ) { if (nest++ == 1) keep = tagname.equals(tag); if (keep) return next.startElement(node,tag); else return invalid; } • Example: document(“...”)/A/*//B Kick Document Child “A” Any Descendant “B” Print
For-Loops • One thread per document reader • Need to queue one tuple from the outer stream each time for $x in E1, $y in E2 return ... startElement, endElement, ....: if node=$x, insert the event into Queue else emit the event to the output (next) endTuple: if node=$x, suspend outer stream; send all events in Queue to E2 else emit all events in Queue to the output (next) endDocument: if node=$y, clear Queue & release outer stream E1 E2 For $x For $y inner outer Queue Loop $x next • Not a good idea if E2 reads a document • the document is read as many times as the tuples in E1 • but we can cache the output of E2 and push the cached data instead
Other Issues • Let-bindings can be easily done using splitters (repeaters) • no caching is necessary • But, ... binary concatenation needs to cache the second stream • so, let $v:=e return ($v,$v) is still hopeless • We don’t need to cache path/FLWOR conditionals • the returned status of the condition events determines the predicate outcome (existential semantics) • initially, Predicate sends a suspend() event to the next stream and then the input events are propagated as is (to both pred and next) • if and when the predicate becomes true, the output is released Predicate condition pred next Sink
So, to Pull or to Push? • For event streams, it doesn't really make a difference in terms of efficiency/storage requirements • a matter of programming style • push-based is a bit more difficult to program and harder to debug (threads) • But, ... if you want to use indexes, pulling is better • For indexing, fragments are a better alternative to events • fragment = a reference to an element in a document • a fragment corresponds to a tree node, and you need an index to access descendants • need to guarantee that indexes deliver fragments sorted, so that all stream operators can be implemented using merge joins • examples: • structural indexes based on region encoding or on preorder/postorder ranks • IR-style content-based inverse indexes • see my recent work on XQuery processing with relevance ranking http://lambda.uta.edu/XQueryRank.pdf
Related Work • Joost: XSLT transformation based on SAX • BEA/XQRL: pull-based XQuery processing • Apache Cocoon: user-constructed pipelines made out of SAX handlers • Many XQuery processors: Galax, Xalan, Qizx, Saxon, ... • Lots of work on XPath/XQuery processing based on transducers