250 likes | 491 Views
RAINDROP: XML Stream Processing Engine. Murali Mani, WPI @UPenn, DB seminar June 08, 2006. Partially Supported by NSF grant IIS 0414567. Acknowledgements. NSF for the financial support Joint work with several others Prof. Elke A. Rundensteiner
E N D
RAINDROP: XML Stream Processing Engine Murali Mani, WPI @UPenn, DB seminar June 08, 2006 Partially Supported by NSF grant IIS 0414567
Acknowledgements • NSF for the financial support • Joint work with several others • Prof. Elke A. Rundensteiner • Graduate students – Hong Su, Ming Li, Mingzhu Wei, Shoushen Wang, Jinhui Jian • Undergraduate students – Drew Ditto, Bogomil Tselkov • … DSRG, WPI
Applications • Need for efficient stream data processing • Monitor patient data in real time • Sensor networks – fire detection; battle field deployment; traffic congestion • Others – news delivery, monitor network traffic, … DSRG, WPI
Token-by-Token access manner Pattern retrieval + Filtering + Restructuring timeline for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <InterestedAuction> $a, $e </InterestedAuction> <auction> <privacy> No <open_auctions> XML Stream Processing <open_auctions> <auction> <privacy>No</privacy> <description> Calendar of <emph>French Impressionism</emph>by<emph>Monet </emph> </description> <initial> $20 </initial> </auction> … DSRG, WPI
privacy auctions auction 1 0 2 3 5 4 for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <Auction>$a, $e </Auction> description emph Option 1: Automata-Based Pattern Retrieval • Additional Data Structures for • Buffering • Filtering • Restructuring • … When patterns are retrieved depends on the data DSRG, WPI
Tagger Tagger Rewrite by “pushing down selection” Navigate $a, /privacy->$p Select $e = “French Impressionism” Select $e=“French Impressionism” Navigate $a, /privacy-> $p Navigate $a, /description/emph->$e Navigate $a,/description/emph->$e Rewritten Logic Plan Logic Plan for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <Auction> $a, $e </Auction> Tagger Navigate-Scan $a, /privacy -> $p Choose low-level implementation alternatives Select $e = “French Impressionism” Navigate-Index $a, /description/emph -> $e Physical Plan Option 2: “DOM” Based Pattern Retrieval When patterns are retrieved depends on other patterns DSRG, WPI
Which paradigm is better? Minimal pushdown plans win over maximal pushdown when selectivity < 50% DSRG, WPI
Problem • How to provide the framework to choose between these paradigms? • Model both paradigms uniformly as algebraic operators. • Use a cost model to choose optimal plan given data statistics. DSRG, WPI
privacy auctions auction 1 0 2 3 5 4 for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <Auction> $a, $e </Auction> description emph Automaton as TokenNav StructuralJoin $a Select $e=“French …” Select non-empty($b) Extract $a Extract $b Extract $e TokenNav $a, /privacy->$b TokenNav $a,/desc/emph->$e TokenNav $s, /auctions/auction->$a DSRG, WPI
auctions auction 1 0 2 for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <Auction> $a, $e </Auction> DOM Navigation as NodeNav Select $e=“French …” Select non-empty($b) NodeNav $a, /privacy->$b NodeNav $a,/desc/emph->$e Extract $a TokenNav $s, /auctions/auction->$a DSRG, WPI
Exploring the Search Space • A pattern can be retrieved inside the automaton or outside the automaton • However there are dependencies for $a in …/a, $b in $a/…, $c in $b/… NodeNav for $b => NodeNav for $c TokenNav for $b => TokenNav/NodeNav for $c DSRG, WPI
Run-time Optimization • Statistics unknown before data arrives • Statistics could change over time • We need techniques for efficient statistics monitoring, search space exploration and plan migration (safe points for migration) DSRG, WPI
New Query plan Initial Query plan Run-time Optimization statistics • Create an initial plan • Run initial plan and collect statistics at same time • Generate new plan using statistics collected • Pause receiving stream • Migrate to new plan • Resume receiving stream Query plan executor Stream Query Optimizer New Query plan Plan Migrator DSRG, WPI
Executing a Raindrop Plan DSRG, WPI
Key Ideas • Minimum Memory requirements • Discard data early • Output data early DSRG, WPI
privacy auctions auction 1 0 2 3 5 4 for $a in open_auctions/auction[privacy] let $e := $a/description/emph where $e = “French Impressionism” return <Auction> $a, $e </Auction> description emph In-Time Structural Join StructuralJoin $a Select $e=“French …” Select non-empty($b) Extract $a Extract $b Extract $e TokenNav $a, /privacy->$b TokenNav $a,/desc/emph->$e TokenNav $s, /auctions/auction->$a DSRG, WPI
root 0 1 2 3 for $r in /root return <root> <a>$r/a</a> <b>$r/b</b> </root> Better than In-Time Structural Join StructuralJoin $r Extract $b Extract $a a TokenNav $r, /a->$a b TokenNav $r, /b->$b “a” tokens need not be stored TokenNav $s, /root->$r DSRG, WPI
root 0 1 2 3 for $r in /root where $r/a = “value” return <root> <b>$r/b</b> </root> Evaluating Predicates StructuralJoin $r Extract $b Select $a=“value” a Extract $a b TokenNav $r, /b->$b TokenNav $r, /a->$a Once $a=“value” is satisfied, “b” tokens need not be stored TokenNav $s, /root->$r DSRG, WPI
root 0 1 2 3 for $r in /root return <root> <a>$r/a</a> <b>$r/b</b> </root> Using schema knowledge root -> (a*, b*) StructuralJoin $a Extract $b Extract $a a TokenNav $r, /a->$a b TokenNav $r, /b->$b “a”, “b” tokens need not be stored TokenNav $s, /root->$r DSRG, WPI
root 0 1 2 3 for $r in /root where $r/a = “value” return <root> <b>$r/b</b> </root> Using Schema Knowledge for Predicates root -> (b*, a*, c) StructuralJoin $r Extract $b Select $a=“value” a Extract $a b TokenNav $r, /b->$b TokenNav $r, /a->$a Once “c” is seen and $a=“value” is not yet satisfied, “b” tokens can be discarded TokenNav $s, /root->$r DSRG, WPI
Conclusions • Raindrop integrates automaton and “DOM” navigation into one algebraic framework. • Cost-based optimization possible. • Execution minimizes memory requirements. DSRG, WPI
Ongoing Work • Load shedding in XML stream processing. • Utilizing Dynamic schema changes for optimization. DSRG, WPI
Fragment of XQuery supported • FLWR expressions (no conditionals/user defined functions) • Path expressions use only forward axes (child, descendant, descendant or self, attribute) • Predicates supported are of the form: pathExpr relOp constant DSRG, WPI
Issues with correlated queries for $r in /root return <root> for $a in $r/a return <a>$r/b</a> </root> DSRG, WPI
Visit our XQuery engine over XML stream project (RAINDROP) website http://davis.wpi.edu/dsrg/raindrop/ DSRG, WPI