470 likes | 1.43k Views
Distributed Query Optimization Algorithms. System R and R* Hill Climbing and SDD-1. System R (Centralized) Algorithm. Simple (one relation) queries are executed according to the best access path. Execute joins Determine the possible ordering of joins Determine the cost of each ordering
E N D
Distributed Query Optimization Algorithms • System R and R* • Hill Climbing and SDD-1
System R (Centralized) Algorithm • Simple (one relation) queries are executed according to the best access path. • Execute joins • Determine the possible ordering of joins • Determine the cost of each ordering • Choose the join ordering with the minimal cost • For joins, two join methods are considered: • Nested loops • Merge join
System R Algorithm -- Example Names of employees working on the CAD/CAM project • Assume • EMP has an index on ENO, • ASG has an index on PNO, • PROJ has an index on PNO and an index on PNAME
System R Algorithm -- Example • Choose the best access paths to each relation • EMP: sequential scan (no selection on EMP) • ASG: sequential scan (no selection on ASG) • PROJ: index on PNAME (there is a selection on PROJ based on PNAME) • Determine the best join ordering • EMP ASG PROJ • ASG PROJ EMP • PROJ ASG EMP • ASG EMP PROJ • EMP PROJ ASG • PROJ EMP ASG • Select the best ordering based on the join costs evaluated according to the two methods
EMP ASG ASG PROJ PROJ ASG (PROJ ASG) EMP (PROJ ASG) EMP (ASG EMP) PROJ (ASG EMP) PROJ System R Example (cont'd) • Best total join order is one of PROJ EMP ASG EMP × PROJ ASG EMP PROJ × EMP
System R Algorithm • (PROJ ASG) EMP has a useful index on the select attribute and direct access to the join attributes of ASG and EMP. • Final plan: • select PROJ using index on PNAME • then join with ASG using index on PNO • then join with EMP using index on ENO
System R* Distributed Query Optimization • Total-cost minimization. Cost function includes local processing as well as transmission. • Algorithm • For each relation in query tree find the best access path • For the join of n relations find the optimal join order strategy • each local site optimizes the local query processing
Data Transfer Strategies • Ship-whole. entire relation is shipped and stored as temporary relation. If merge join algorithm is used, no need for temporary storage, and can be done in pipeline mode • Fetch-as-needed. this method is equivalent to semijoin of the inner relation with the outer relation tuple
Join Strategy 1 • External relation R with internal relation S, let LC be local processing cost, CC be data transfer cost, let average number of tuples of S that match one tuple of R be s • Strategy 1. Ship the entire outer relation to the site of internal relation TC = LC(get R) + CC(size(R)) + LC(get s tuples from S)*card(R)
Join Strategy 2 • Ship the entire inner relation to the site of the outer relation TC = LC(get S) + CC(size(S)) + LC(store S) + LC(get R) + LC(get s tuples from S)*card(R)
Join Strategy 3 • Fetch tuples of the inner relation for each tuple of the outer relation TC = LC(get R) + CC(len(A)) * card(R) + LC(get s tuples from S) * card(R) + CC(s*len(S))*card(R)
Join Strategy 4 • Move both relations to 3rd site and join there TC = LC(get R) + LC(get S) + CC(size(S)) + LC(store S) + CC(size(R)) + LC(get s tuples from S)*card(R) • Conceptually, the algorithm does an exhaustive search among all alternatives and selects one that minimizes total cost
Hill Climbing Algorithm - Algorithm Inputs query graph, locations of relations, and relation statistics Initial solution the least costly among all when the relations are sent to a candidate result site denoted by ES0, and the site as chosen site Splits ES0 into ES1: ship one relation of join to the site of other relation ES2: these two relations are joined locally and the result is transmitted to the chosen site If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0, else select ES1 and ES2. The process can be recursively applied to ES1 and ES2 till no more benefit occurs
ES0 Cost = 13 Site1 EMP(8) Site2 PAY(4) 8 4 ENO Site3 PROJ(1) Site4 ASG(10) 1 PNO Hill Climbing Algorithm - Example SAL TITLE PAY EMP PNAME=“CAD/CAM” ASG Ignore the local processing cost Length of tuples is 1 for all relation PROJ
ES0 Cost = 13 Site1 EMP(8) Site2 PAY(4) 8 4 Site3 PROJ(1) Site4 ASG(10) 1 Site1 EMP(8) ES1 HCA - Example ? Site2 PAY(4) ES2 Solution 1 Cost = TITLE ? Site4 ASG(10) Site3 PROJ(1) ? ES3 Site1 EMP(8) ES1 Site2 PAY(4) ESo is the “BEST” ES2 Solution 2 Cost = Site4 ASG(10) ES3 Site3 PROJ(1)
Hill Climbing Algorithm - Comments • Greedy algorithm: determines an initial feasible solution and iteratively tries to improve it. • If there are local minimas, it may not find the global minima • If the optimal solution has a high initial cost, it won’t be found since it won’t be chosen as the initial feasible solution. Site1 EMP(8) Site2 PAY(4) Site4 ASG(10) Site3 PROJ(1) COST =
SDD-1 Algorithm • SDD-1 algorithm generalized the hill-climbing algorithm to determine ordering of beneficial semijoins; and uses statistics on the database, called database profiles. • Cost of semijoin: Cost (R SJA S) = CMSG + CTR*size(A(S)) • Benefit is the cost of transferring irrelevant tuple Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR • A semijoin is beneficial if cost < benefit.
SDD-1: The Algorithm • initialization phase generates all beneficial semijoins, and an execution strategy that includes only local processing • most beneficial semijoin is selected; statistics are modified and new beneficial semijoins are selected • the above step is done until no more beneficial joins are left • assembly site selection to perform local operations • postoptimization removes unnecessary semijoins
SDD1 - Example SELECT * FROM EMP, ASG, PROJ WHERE EMP.ENO = ASG.ENO AND ASG.PNO = PROJ.PNO Site 2 ASG ENO PNO Site 1 EMP Site 3 PROJ
SDD1 - First Iteration • SJ1: ASG SJ EMP benefit = (1-0.3)*3000 = 2100; cost = 120 • SJ2: ASG SJ PROJ benefit = (1-0.4)*3000 = 1800 cost = 200 • SJ3: EMP SJ ASG benefit = (1-0.8)*1500 = 300; cost = 400 • SJ4: PROJ SJ ASG benefit = 0; cost = 400 • SJ1 is selected • ASG size is reduced to 3000*0.3=900 ASG’ = ASG SJ EMP • Semijoin selectivity factor is reduced; it is approximated by SFSJ(G’.ENO)= 0.8*0.3 = 0.24, SFSJ(G’PNO)=1.0*0.3 =0.3, size(G’.ENO)= 400*0.3=120, size(G’.PNO) = 120
SDD-1 - Second & Third Iterations Second iteration • SJ2: ASG’ SJ PROJ benefit=(1-0.4)*900=540 cost=200; • SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140 cost=120 • SJ4: PROJ SJ ASG’, benefit=(1-0.3)*2000=1400cost=120 • SJ4 is selected PROJ’ = PROJ SJ ASG’ size(PROJ’) = 2000*0.3 = 600SFSJ(J’)=0.4*0.3=0.12size(J’.PNO)=200*0.3=60 Third Iteration • SJ2: ASG’ SJ PROJ benefit=(1-0.12)*900=792 cost=60; • SJ3: EMP SJ ASG’; benefit=(1-0.24)*1500=1140 cost=120 • SJ3 is selected reduces size of E to 1500*0.24=360 • Finally SJ2 is selected, with size of G as 108
Local Optimization • Each site optimizes the plan to be executed at the site • A centralized query optimization problem
SDD-1 - Assembly Site Selection • After reduction EMP is at site 1 with size 360 ASG is at site 2 with size 108 PROJ is at site 3 with size 600 • Site 3 is chosen as assembly site • SJ4 is removed in post optimization. Site1 EMP Site2 ASG Site3 PROJ (ASG SJ EMP) SJ PROJ site 3 (EMP SJ ASG) site 3 join at site 3