430 likes | 589 Views
Distributed Database Management Systems. Lecture 35. In the previous lecture. Query Optimization Centralized QO Best access path Join Processing QO in Distributed Environment. In this lecture. Query Optimization Fragmented Queries Joins replaced by Semijoins Three major QO algorithms.
E N D
Distributed Database Management Systems Lecture 35
In the previous lecture • Query Optimization • Centralized QO • Best access path • Join Processing • QO in Distributed Environment.
In this lecture • Query Optimization • Fragmented Queries • Joins replaced by Semijoins • Three major QO algorithms.
Reduces cost of join queries • Semijoin is ……. • Join of two relations can be replaced SJ of one or both relations.
So R⋈A S can be replaced: • (R ⋉A S) ⋈A S • R ⋈A (S ⋉A R) • (R ⋉A S) ⋈A (S ⋉A R) • Which one? • Need to estimate costs.
Same Assumptions: • R at site 1, S at site 2 • Size (R) < Size (S), so • A (S) site 1 • Site1 computes R’ = R⋉A S’ • R’ site 2 • Site2 computes R’ ⋈A S
Ignoring Tmsg semijoin is better if • Size(A(S)) + size(R ⋉A S) < size(R) • Join is better if …..- • Semijoin is better if…..-.
SJ with more than two tables Will be more complex • Semijoin approach can be applied to each individual join, consider EMP ⋈ ASG ⋈ PROJ
EMP ⋈ ASG ⋈ PROJ = • EMP’ ⋈ ASG’ ⋈ PROJ where • EMP’ = EMP ⋉ ASG and • ASG’ = ASG ⋉ PROJ rather • EMP” = EMP ⋉ (ASG ⋉ PROJ)
Many SJ expressions possible for a relation • “Full reducer” a SJ expression that reduces R the maximum • Not exists for cyclic queries.
Select eName From EMP, ASG, PROJ Where EMP.eNo = ASG.eNo and ASG.eNo = PROJ.eNo and EMP.city = PROJ.city
ASG pNo eNo PROJ EMP city Cyclic Query ASG eNo, city pNo, city PROJ EMP Tree Query
Full Reducer may be hard to find. • Easy for a chained query • Most systems use single SJs to reduce relation size.
Three main representative algos are • Distributed INGRES Algorithm • R* Algorithm • SDD-1 Algorithm.
R* Algorithm • Static, exhaustive • Algorithm supports fragmentation, actual implementation doesn’t • Master, execution and apprentice sites.
Optimizer of Master site makes inter-site decisions • Apprentice sites make local decisions • Optimizes local processing time & communication time.
Optimizer, based on stats of DB and size of iterm results, decides about • Join Ordering • Join Algo (nested/mergeJoin) • Access path (indexed/seq.).
Inter-site transfers • Ship-whole • Entire relation transferred • Stored in a temp relation • In case of merge-join approach, tuples can be processed as they arrive • Fetch-as-needed • nExternal relation is sequentially scanned • Join attribute value is sent to other relation • Relevant tuples scanned at other site and sent to first site.
Inter-site transfers: comparison • Ship-whole • larger data transfer • smaller number of messages • better if relations are small • Fetch-as-needed • number of messages = O(cardinality of external relation) • data transfer per message is minimal • better if relations are large and the join selectivity is good.
Example, join of an external relation R with an internal relation S, there are four strategies.
1-Move outer relation tuples to the site of the inner relation • Can be joined as they arrive • Total Cost = LT (retrieve card(R) tuples from R) + CT (size(R)) + LT (retrieve s tuples from S) * card (R)
2- Move inner relation to the site of outer relation • cannot join as they arrive; they need to be stored • Total Cost = LT (retrieve card(S) tuples from S) + CT (size (S)) + LT (store card(S) tuples as T) + LT (retrieve card(R) tuples from R) + LT (retrieve s tuples from T) * card (R).
3- Fetch inner tuples as needed • For each tuple in R, send join attribute value to site of S • Retrieve matching inner tuples at site S • Send the matching S tuples to site of R • Join as they arrive
Total Cost = LT (retrieve card(R) tuples from R)+ CT (length(A) * card (R)) + LT(retrieve s tuples from S) * card(R) + CT (s * length(S)) * card(R)
4- Move both inner and outer relations to another site • Example: A query consisting join of PROJ (ext) and ASG (int) on pNo • Four strategies
1- Ship PROJ to site of ASG 2- Ship ASG to site of PROJ 3- Fetch ASG tuples as needed for each tuple of PROJ 4- Move both to a third site Optimization involves costing for each possibility.
That is it regarding R* algorithm for distributed query optimization • Lets review it.
SDD-1 Algorithm • System for Distributed Databases • A non-commercial database.
Based on the Hill Climbing Algorithm • No semijoins, No rep/frag • Cost of transferring the result to the user site from the final result site is not considered • Can minimize either total time or response time
Input include • Query Graph • Locations of relations • Relations’ statistics.
1- Do the initial local processing 2- Select the initial best plan (ES0) • Calculate cost of moving all relations to a single site • Plan with the least cost is ES0
3- Split ES0 into ES1 and ES2 • ES1: Sending one of the relation to other site, relations joined there • ES2:Sending the result back to site in ES0.
4- Replace ES0 with ES1 and ES2 when we should have cost(ES1) + cost(local join) + cost (ES2) < cost (ES0) 5- Recursively apply step 3 and 4 on ES1 and ES2, until no improvement
Example • “Find the salaries of engineers working on CAD/CAM project” • Involves EMP, PAY, PROJ and ASG • sal(PAY ⋈title(EMP ⋈eNo(ASG ⋈pNo(pName = ‘CAD/CAM’(PROJ)))))
Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)
Considering only transfers costs • Site 1 • PAY site 1 = 4 • PROJ site 1 = 1 • ASG site 1 = 10 • Total = 15
Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)
Considering only transfers costs • Site 1 • PAY site 1 = 4 • PROJ site 1 = 1 • ASG site 1 = 10 • Total = 15
Cost for site 2 = 19 • Cost for site 3 = 22 • Cost for site 4 = 13 • So site 4 is our ES0 • Move all relations to site 4.