360 likes | 492 Views
Distributed Database Management Systems. Lecture 33. In the previous lecture. Final phase of QD Data Localization: for HF, VF and DF. In today’s Lecture. Data Localization for Hybrid Fragmentation Query Optimization. Reduction for HyF. HyF contains both types of Fragmentations
E N D
Distributed Database Management Systems Lecture 33
In the previous lecture • Final phase of QD • Data Localization: for HF, VF and DF.
In today’s Lecture • Data Localization for Hybrid Fragmentation • Query Optimization.
Reduction for HyF • HyF contains both types of Fragmentations • EMP1=eNo ≤ E4 (eNo, eName (EMP)) • EMP2=eNo > E4 (eNo, eName (EMP)) • EMP3=eNo, title (EMP).
eName eName eNo = E5 eNo = E5 ⋈eNo U EMP2 Reduced Query- EMP3 EMP1 EMP2 • Select eName from EMP where eNo = E5
Query Decomposition: generates an efficient query in relational algebra • Normalization, Analysis, Simplification, Rewriting • Data Localization: applies global query to fragments; increases optimization level-
So, next is the cost-based optimization • Mainly concentrates on the order of performing joins • Characteristics of relations like cardinalities are considered
First QO in general • QO refers to producing a Query Execution plan (QEP) that represents execution strategy.
Components of Optimizer • Search Space: set of eq. alternative exec plans • Cost Model: predicts cost of a execution plan • Search Strategy: produces best plan
Search Space • Search space consists of eq. Query Trees produced using Tr Rules • Optimizer concentrates on join trees, since join cost is the most effective
Example: • Select eName, resp From EMP, ASG, PROJ where EMP.eNo = ASG. eNo and ASG.pNo = PROJ.pNo.
⋈pNo ⋈eNo ⋈eNo ⋈pNo PROJ EMP EMP ASG PROJ ASG ⋈pNo, eNo x ASG EMP PROJ
Alternatives with N relations are O(N!) based on properties of relations • So, restrictions are applied
1- Heuristics - Selection and projection on base relations - Avoid Cartesian product
2- Shape of Tree • Linear Tree: At least one node for each operand is a base relation • Bushy tree: May have operators with interm tables only; allows parallel execution
Search Strategy • Most popular is Dynamic Programming • That starts with base relations and keeps on adding relations calculating cost
DP is almost exhaustive so produces best plan • Too expensive with more than 5 relations • Other option is Randomized strategy • Do not guarantee best
Cost Model • Cost of operators, statistics of base data to predict size of intermediate tables • Cost considered as Total Time and Response Time.
Total time = CPU time + I/O time + tr time • In WAN, major cost is tr time • Initially ratios were 20:1 for tr and I/O, for LAN it is 1:1.6
Response time = CPU time + I/O time + tr time • Difference.?
TCPU = time for a CPU inst • TI/O = a disk I/O • TMSG = fixed time for initiating and recv a msg • TTR = transmit a data unit from one site to another.
X units Site 1 Site 3 Site 2 Y units • TT = 2TMSG + TTR*(x+y) • RT = max{TMSG + TTR*X, TMSG + TTR*Y}
Database Statistics • Major factor is interm tabs • If the interm results are to be transmitted, then estimation about size is a must • More precise statistics cost more
For each relation R[A1, A2, …, An] fragmented as R1, …, Rr • length of each attribute: length(Ai) • the number of distinct values for each attribute in each fragment: card(Ai(Rj)) • maximum and minimum values in the domain of each attribute: min(Ai), max(Ai).
4.The cardinalities of each domain: card(dom[Ai]) and the cardinalities of each fragment: card(Rj) 5.Join selectivity factor for some of the relationsSFJ (R,S) = card(R ⋈ S)/ (card(R) ∗ card(S))-
Selection Operation • Card(F(R))=SFS(F) * card(R) • SFS(A = value) = 1/card(A(R)) • SFS(A > value) = max(A) – value /(max(A) – min(A)) • SFS(A < value) = value - min(A) /(max(A) – min(A))
SFS(A < value) = max(A) – value /(max(A) – min(A)) • SFS(p(Ai) ^ p(Aj)) = SFS(p(Ai)) * (SFSp(Aj)) • SFS(p(Ai) v p(Aj)) = SFS(p(Ai)) + SFS(p(Aj))–(SFS(p(Ai))* SFS(p(Ai))).
Cardinality of Projection • Hard to determine precisely • Two cases when it is trivial 1- When a single attribute A, card(A(R)) = card (A) 2- When PK is included card(A(R)) = card (R)
Cartesian Product • card(RxS) = card (R) * card(S). • Cardinality of Join • No general way to test without additional information • In case of PK/FK combination Card(R ⋈ S) = card (S)
Semi Join: SFSJ(R ⋉AS)= card(A(S))/ card(dom[A]) card(R ⋉AS) = SFSJ(S.A) * card(R).
Union: Hard to estimate • Limits possible which are card(R) + card(S) and max{card (R) + card (S)) • Difference: Like Union, card (R) for (R-S), and 0
Why to Study • Distributed Query is transformed into local ones • Issues are related and more complex in DD • Easier to understand • Two famous ones.
INGRES • Dynamic • Recursively breaks into smaller ones • System R • static • exhaustive search