1 / 14

The Joy of SAX

Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/. The Joy of SAX. Design Goals. Want to build an XQuery engine based entirely on SAX handlers

lorand
Download Presentation

The Joy of SAX

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Leonidas Fegaras University of Texas at Arlington fegaras@cse.uta.edu http://lambda.uta.edu/ The Joy of SAX

  2. Design Goals • Want to build an XQuery engine based entirely on SAX handlers • all the way from the points the input documents are read by the SAX parser up to the point the query results are printed • This engine should consist of operators that • naturally reflect the syntactic structures of XQuery and • can be composed into pipelines in the same way the corresponding XQuery structures are composed to form complex queries • The XQuery translation should be concise, clean, and completely compositional • Even though it cannot compete with transducers for simple XPaths, it should not sacrifice much on performance in terms of memory and computational overhead • But, ... it should be able to beat transducers for complex predicates and deeply nested queries

  3. Pull-Based Approach • Based on iterators: class Iterator { Tuple current(); // current tuple from stream void open (); // open the stream iterator Tuple next (); // get the next tuple from stream boolean eos (); // is this the end of stream? } • An iterator reads data from the input stream(s) and delivers data to the output stream • Connected through pipelines • an iterator (the producer) delivers a stream element to the output only when requested by the next operator in pipeline (the consumer) • to deliver one stream element to the output, the producer becomes a consumer by requesting from the previous iterator as many elements as necessary to produce a single element, etc, until the end of stream

  4. What is a Tuple? • A vector of components: • one component for each scoped for-variable • has fixed-size at each point in a pipeline (known at compile time) • doesn't need to include the variable names • A tuple component is the unit of communication between iterators • Passing fully constructed XML elements through iterators is a bad idea for a compositional translation • initially, we would have to pass the entire document as a tree! • The unit of communication should be • a single event or • a fragment (a reference to an XML element in a document) • this requires a structural index for fragments • A proposal for a pull parser: XML Pull Parser 3 www.xmlpull.org • BEA/XQRL token stream & token iterators

  5. Event-Oriented Approach • A tuple in an event-oriented approach consists of a sequence of events, ending with an End-Of-Tuple (EOT) event • Single-node event sequence • depth-first unfolding of a single XML node <start “A”> <start “B”> <text “x”> <end “B”> <start “B”> <text “y”> <end “B”> <end “A”> <text “z”> <start “A”> <start “B”> <text “w”> <end “B”> <end “A”> <EOT> A tuple with 3 components

  6. Element vs Event Granularity Stream unit is a single event abstract class Event {} class Start extends Event { String tag; } class End extends Event { String tag; } class Text extends Event { String text; } class EOT extends Event {} class Child extends Iterator { Iterator input; String tagname; boolean keep = false; int nest = 0; Event next () { while (!input.eos()) { current = input.current(); if (current instanceof Start) { if (nest++ == 1) keep = ((Start) current).tag .equals(tagname); } else if (current instanceof End) if (nest-- == 1) keep = false; input.next(); if (keep) return current; } } } Stream unit is a DOM-like element: abstract class Element {} class Node extends Element { String tag; Element[] sequence; } class Text extends Element { String text; } class Tuple { Element[] components; } class Child extends Iterator { Iterator input; String tagname; int index = 0; Tuple next () { while (!input.eos()) { if (input.current().get(0) instanceof Node) { Node ce = (Node) input.current().get(0); if (index < ce.sequence.length) if (ce.sequence[index] instanceof Node && ((Node) ce.sequence[index]) .tag.equals(tagname)) { current = new Tuple(ce.sequence[index++]); return current; } else index++; else { index = 0; input.next(); } } else { index = 0; input.next(); } } }

  7. For-Loop using Iterators Need a stepper for a for-loop: class Step extends Iterator { boolean first; Tuple tuple; void open () { first = true; current = tuple; } Tuple next () { first = false; return current; } void set ( Tuple t ) { tuple = t; } boolean eos () { return !first; } } Tuple Loop.next () { if (!left.eos()) { while (right.eos()) { left.next(); right_step.set(left.current()); right.open(); }; current = left.current().append(right.current()); right.next(); return current; } } Not a good idea if right reads a document! Loop right right_step left right pipeline Step set class Loop extends Iterator { Iterator left; Step right_step; Iterator right; }

  8. Let-Bindings using Iterators Let-bindings are harder to implement: • the let-value may be a sequence • one producer -- many consumers • we do not want to materialize the let-value in memory queue tail head fastest consumer slowest consumer backlog Some cases are hopeless: let $v:=e return ($v,$v)

  9. Push-based Pipelines • Unit of communication between pipelines: messages rather than events • Pipeline components are SAX-like event handlers • they are instances of Operator subclasses: abstract class Operator { void suspend (); void release (); void startDocument ( int node ); void endDocument ( int node ); Status endTuple ( int node ); Status startElement ( int node, String tag ); Status endElement ( int node, String tag ); Status characters ( int node, String text ); } ('node' identifies a for-variable)

  10. The Child Operator class Child extends Operator { Operator next; String tagname; int nest = 0; boolean keep = false; Status startElement ( int node, String tag ) { if (nest++ == 1) keep = tagname.equals(tag); if (keep) return next.startElement(node,tag); else return invalid; } • Example: document(“...”)/A/*//B Kick Document Child “A” Any Descendant “B” Print

  11. For-Loops • One thread per document reader • Need to queue one tuple from the outer stream each time for $x in E1, $y in E2 return ... startElement, endElement, ....: if node=$x, insert the event into Queue else emit the event to the output (next) endTuple: if node=$x, suspend outer stream; send all events in Queue to E2 else emit all events in Queue to the output (next) endDocument: if node=$y, clear Queue & release outer stream E1 E2 For $x For $y inner outer Queue Loop $x next • Not a good idea if E2 reads a document • the document is read as many times as the tuples in E1 • but we can cache the output of E2 and push the cached data instead

  12. Other Issues • Let-bindings can be easily done using splitters (repeaters) • no caching is necessary • But, ... binary concatenation needs to cache the second stream • so, let $v:=e return ($v,$v) is still hopeless • We don’t need to cache path/FLWOR conditionals • the returned status of the condition events determines the predicate outcome (existential semantics) • initially, Predicate sends a suspend() event to the next stream and then the input events are propagated as is (to both pred and next) • if and when the predicate becomes true, the output is released Predicate condition pred next Sink

  13. So, to Pull or to Push? • For event streams, it doesn't really make a difference in terms of efficiency/storage requirements • a matter of programming style • push-based is a bit more difficult to program and harder to debug (threads) • But, ... if you want to use indexes, pulling is better • For indexing, fragments are a better alternative to events • fragment = a reference to an element in a document • a fragment corresponds to a tree node, and you need an index to access descendants • need to guarantee that indexes deliver fragments sorted, so that all stream operators can be implemented using merge joins • examples: • structural indexes based on region encoding or on preorder/postorder ranks • IR-style content-based inverse indexes • see my recent work on XQuery processing with relevance ranking http://lambda.uta.edu/XQueryRank.pdf

  14. Related Work • Joost: XSLT transformation based on SAX • BEA/XQRL: pull-based XQuery processing • Apache Cocoon: user-constructed pipelines made out of SAX handlers • Many XQuery processors: Galax, Xalan, Qizx, Saxon, ... • Lots of work on XPath/XQuery processing based on transducers

More Related