130 likes | 230 Views
Using Partial Evaluation in Distributed Query Evaluation. Peter Buneman, Gao Cong, Wenfei Fan, Anastasios (Tasos) Kementsietsidis. Cutting Down Trees…. Tell me when GOOG stock sells for 376: [//stock[code = “GOOG” sell = 376]. portofolio. P 0. broker. P 1. broker. name. name. market.
E N D
Using Partial Evaluation inDistributed Query Evaluation Peter Buneman, Gao Cong, Wenfei Fan, Anastasios (Tasos) Kementsietsidis
Cutting Down Trees… Tell me when GOOG stock sells for 376:[//stock[code = “GOOG”sell = 376] portofolio P0 broker P1 broker name name market P2 … … market market P2 MerillLynch Bache name stock name name stock stock stock stock NASDAQ NASDAQ code buy sell NYSE code buy sell code buy sell code buy sell code buy sell GOOG $370 $372 AAPL $71 $65 YHOO $33 $35 GOOG $374 $373 IBM $80 $78 Not Let’s do a Depth-first traversal. We visit: P0 P1 P2 P1 P0P2 P0 Let’s stream! VLDB 2006
Status report… • We have XML Trees arbitrarily fragmented and distributed • We want to execute Boolean Xpath queries Q = [q] over thefragmented trees. q := p | p/text()=str | label() = A | ¬q | qq | qq p := | A | * | p//p | p/p | p[q] Lessons learned: • We want to visit each peer only once, irrespectively of thenumber of (tree) fragments it stores. • We want to minimize communication costs. Ideally, no fragmentdata should be send while evaluating a query.Our motto: Send processing to data NOT data to processing VLDB 2006
Partial Evaluation Consider a function f (s, d ) and part of its input, says. Then, partial evaluation is to specialize f (s, d ), i.e., to perform the part of f ’s computation that depends only on s. This generates a residual function g(d) that depends only on d. VLDB 2006
F0 Fragment Tree F1 F3 F2 Tree Fragments Fragment F0 Fragment F1 Fragment F2 Fragment F3 portofolio market broker market name F1 broker name name … F2 stock stock name NASDAQ NASDAQ stock F3 stock MerillLynch market … code buy sell code buy sell Bache code buy sell code buy sell name stock GOOG $374 $373 $372 GOOG $370 YHOO $33 $35 NYSE AAPL $71 $65 code buy sell IBM $80 $78 VLDB 2006
portofolio F1 broker name F3 market … Bache name stock NYSE code buy sell IBM $80 $78 Partial Evaluation inDistributed Query Evaluation Main idea: Given a query Q, send Q to every peer holding a fragment Answer of Q:Computed by solving a linear system of Boolean equations [//stock[code = “GOOG”sell = 376] P2has two fragmentsbut is only visited once P0 P1 P2 • ComputePartial Answers (Boolean formulas): • Q is evaluated bottom-up • We use Boolean variables for the evaluation of fragment nodes VLDB 2006
Query Evaluation Example 1: <0,0, 0, 0, 1> market <0,0, 1, 1, 1> stock … <1,0, 0, 0, 0> code buy sell <0,1, 0, 0, 0> $376 GOOG $370 Query Evaluation Example 2: <0,0, 0, 0, x1> market <0,0, x1, x1, x1> stock … F <1,0, 0, 0, 0> code buy <x0,x1, x2, x3, x4> GOOG $370 Query Evaluation Query Representation: Q = [//stock[code = “GOOG”sell = 376] q4: //q3 q3: stock[q2] q2: */q0*/q1 q1: sell = 376 q0: code = “GOOG” Q = <q0,q1, q2, q3, q4> VLDB 2006
ParBoX comes in flavors: • HybridParBoX • FullDistParBoX • LazyParBoX The ParBoX Algorithm Three stages • Stage 1: Querying peer PQ sends query Q to all peers having a fragment (use the fragment tree to identify all such peers) • Stage 2: Evaluate Q, in parallel, over each fragmentFi in peer Pj • Stage 3: Collect partial answers in PQand computethe answer to Q. Key considerations/concerns: • (Total/Parallel) Computation costs. • Communication costs. • Level of fragmentation. F0 (P0) F1 (P1) F3 (P2) F2 (P2) VLDB 2006
Analysis of Algorithms Communication costs are LOW and independent of T (the data) card(Si) = # of fragments in peer Pi Computation costs are comparable to the best-known centralized algorithm card(T) = # of fragments of tree T. Note that card(T) ≤ |T| |FSj| = sum of fragments (sizes) in peer Pj VLDB 2006
The Experimental Study The setting: • Ten (10) Linux machines (peers) distributed over a local LAN • XMark “sites” are fragmented and distributed over the network.Their sizes vary between 5MB-150MB. The parameters: • # of machines participating in each experiment • Size of query Q • Size of tree T • The “shape” of the fragment tree • Number of fragments in the tree • Nesting level (deep vs. shallow fragment trees) • Number of fragments per machine VLDB 2006
NaiveCentralized vs. ParBoX |T| = 50MB |Q| = 8 # fragment/peer = 1 Parallelism works! Shipping data costs! With |T| fixed, as we increase the number of machines, the difference (between iterations) in the size of the fragment that is allocated in each machine decreases. VLDB 2006
Varying Query and Data Size # peers = 8 # fragment/peer = 1 F0 F1 F3 F2 F4 F6 F7 F5 VLDB 2006
Summary • We (practically) proved that partial evaluation is effective inXML query processing of fragmented XML document trees. • We presented the family of ParBoX algorithms to evaluateBoolean Xpath queries. Our algorithms guarantee that: • Optimal computation costs. • Each peer is visited only once. • Communication is depends only on the query size (and not the tree) The question in everybody’s mind… Can we extend this idea to non-boolean Xpath queries??? The answer isYES… but you have to wait a bit to read about it!! VLDB 2006