1 / 42

Distributed Database Management Systems

Distributed Database Management Systems. Lecture 35. In the previous lecture. Query Optimization Centralized QO Best access path Join Processing QO in Distributed Environment. In this lecture. Query Optimization Fragmented Queries Joins replaced by Semijoins Three major QO algorithms.

essien
Download Presentation

Distributed Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Database Management Systems Lecture 35

  2. In the previous lecture • Query Optimization • Centralized QO • Best access path • Join Processing • QO in Distributed Environment.

  3. In this lecture • Query Optimization • Fragmented Queries • Joins replaced by Semijoins • Three major QO algorithms.

  4. Semijoin based Algorithms

  5. Reduces cost of join queries • Semijoin is ……. • Join of two relations can be replaced SJ of one or both relations.

  6. So R⋈A S can be replaced: • (R ⋉A S) ⋈A S • R ⋈A (S ⋉A R) • (R ⋉A S) ⋈A (S ⋉A R) • Which one? • Need to estimate costs.

  7. Same Assumptions: • R at site 1, S at site 2 • Size (R) < Size (S), so • A (S)  site 1 • Site1 computes R’ = R⋉A S’ • R’  site 2 • Site2 computes R’ ⋈A S

  8. Ignoring Tmsg semijoin is better if • Size(A(S)) + size(R ⋉A S) < size(R) • Join is better if …..- • Semijoin is better if…..-.

  9. SJ with more than two tables Will be more complex • Semijoin approach can be applied to each individual join, consider EMP ⋈ ASG ⋈ PROJ

  10. EMP ⋈ ASG ⋈ PROJ = • EMP’ ⋈ ASG’ ⋈ PROJ where • EMP’ = EMP ⋉ ASG and • ASG’ = ASG ⋉ PROJ rather • EMP” = EMP ⋉ (ASG ⋉ PROJ)

  11. Many SJ expressions possible for a relation • “Full reducer” a SJ expression that reduces R the maximum • Not exists for cyclic queries.

  12. Select eName From EMP, ASG, PROJ Where EMP.eNo = ASG.eNo and ASG.eNo = PROJ.eNo and EMP.city = PROJ.city

  13. ASG pNo eNo PROJ EMP city Cyclic Query ASG eNo, city pNo, city PROJ EMP Tree Query

  14. Full Reducer may be hard to find. • Easy for a chained query • Most systems use single SJs to reduce relation size.

  15. Distributed Query Processing Algorithms

  16. Three main representative algos are • Distributed INGRES Algorithm • R* Algorithm • SDD-1 Algorithm.

  17. R* Algorithm • Static, exhaustive • Algorithm supports fragmentation, actual implementation doesn’t • Master, execution and apprentice sites.

  18. Optimizer of Master site makes inter-site decisions • Apprentice sites make local decisions • Optimizes local processing time & communication time.

  19. Optimizer, based on stats of DB and size of iterm results, decides about • Join Ordering • Join Algo (nested/mergeJoin) • Access path (indexed/seq.).

  20. Inter-site transfers • Ship-whole • Entire relation transferred • Stored in a temp relation • In case of merge-join approach, tuples can be processed as they arrive • Fetch-as-needed • nExternal relation is sequentially scanned • Join attribute value is sent to other relation • Relevant tuples scanned at other site and sent to first site.

  21. Inter-site transfers: comparison • Ship-whole • larger data transfer • smaller number of messages • better if relations are small • Fetch-as-needed • number of messages = O(cardinality of external relation) • data transfer per message is minimal • better if relations are large and the join selectivity is good.

  22. Example, join of an external relation R with an internal relation S, there are four strategies.

  23. 1-Move outer relation tuples to the site of the inner relation • Can be joined as they arrive • Total Cost = LT (retrieve card(R) tuples from R) + CT (size(R)) + LT (retrieve s tuples from S) * card (R)

  24. 2- Move inner relation to the site of outer relation • cannot join as they arrive; they need to be stored • Total Cost = LT (retrieve card(S) tuples from S) + CT (size (S)) + LT (store card(S) tuples as T) + LT (retrieve card(R) tuples from R) + LT (retrieve s tuples from T) * card (R).

  25. 3- Fetch inner tuples as needed • For each tuple in R, send join attribute value to site of S • Retrieve matching inner tuples at site S • Send the matching S tuples to site of R • Join as they arrive

  26. Total Cost = LT (retrieve card(R) tuples from R)+ CT (length(A) * card (R)) + LT(retrieve s tuples from S) * card(R) + CT (s * length(S)) * card(R)

  27. 4- Move both inner and outer relations to another site • Example: A query consisting join of PROJ (ext) and ASG (int) on pNo • Four strategies

  28. 1- Ship PROJ to site of ASG 2- Ship ASG to site of PROJ 3- Fetch ASG tuples as needed for each tuple of PROJ 4- Move both to a third site Optimization involves costing for each possibility.

  29. That is it regarding R* algorithm for distributed query optimization • Lets review it.

  30. SDD-1 Algorithm • System for Distributed Databases • A non-commercial database.

  31. Based on the Hill Climbing Algorithm • No semijoins, No rep/frag • Cost of transferring the result to the user site from the final result site is not considered • Can minimize either total time or response time

  32. Input include • Query Graph • Locations of relations • Relations’ statistics.

  33. 1- Do the initial local processing 2- Select the initial best plan (ES0) • Calculate cost of moving all relations to a single site • Plan with the least cost is ES0

  34. 3- Split ES0 into ES1 and ES2 • ES1: Sending one of the relation to other site, relations joined there • ES2:Sending the result back to site in ES0.

  35. 4- Replace ES0 with ES1 and ES2 when we should have cost(ES1) + cost(local join) + cost (ES2) < cost (ES0) 5- Recursively apply step 3 and 4 on ES1 and ES2, until no improvement

  36. Example • “Find the salaries of engineers working on CAD/CAM project” • Involves EMP, PAY, PROJ and ASG • sal(PAY ⋈title(EMP ⋈eNo(ASG ⋈pNo(pName = ‘CAD/CAM’(PROJ)))))

  37. Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)

  38. Considering only transfers costs • Site 1 • PAY  site 1 = 4 • PROJ  site 1 = 1 • ASG  site 1 = 10 • Total = 15

  39. Assume Tmsg = 0 and TTR = 1 Length of a tuple is 1 So size(R) = card(R)

  40. Considering only transfers costs • Site 1 • PAY  site 1 = 4 • PROJ  site 1 = 1 • ASG  site 1 = 10 • Total = 15

  41. Cost for site 2 = 19 • Cost for site 3 = 22 • Cost for site 4 = 13 • So site 4 is our ES0 • Move all relations to site 4.

  42. Thanks

More Related