110 likes | 123 Views
Notions of clustering. Clustered file : e.g. store movie tuples together with the corresponding studio tuple. Clustered relation : tuples are stored in blocks mostly devoted to that relation. Clustering index : tuples (of the relation) with same search key are stored together.
E N D
Notions of clustering • Clustered file: e.g. store movie tuples together with the corresponding studio tuple. • Clustered relation: tuples are stored in blocks mostly devoted to that relation. • Clustering index: tuples (of the relation) with same search key are stored together.
Index-based algorithms: selection • To evaluate a=(R) use index on a, if it exists • Cost: cost of index lookup (negligible) plus • If index is clustering: B(R)/V(R,a) I/O’s (the fraction of the relation with some value for a) • Otherwise, an approximation is that each tuple we retrieve is in a different block, so we get: T(R)/V(R,a) I/O’s
Example of index-based selection • a=(R), and B(R) = 1000, T(R) =20,000 • R is clustered, and no index on attribute a 1000 disk I/O’s • R is unclustered, and no index on attribute a 20,000 I/O’s • R has a clustering index on a, V(R,a) = 100 10 I/O’s • R has a non-clustering index on a, V(R,a) = 100 20,000/100 = 200 disk I/O’s • V(R,a) = 20,000 (i.e. attribute a is key) just 1 I/O
Index joins • We want to compute R(X,Y) S(Y,Z) • Suppose there is a Y-index on S. • For each tuple t of R, lookup all tuples in S with key-value t[Y] and output the join of t. • Cost: B(R) to read R (clustered case) --- We ignore this cost • Each tuple of R joins with T(S)/V(S,Y) tuples of S, on average. • S has a non-clustered index on Y: T(R)T(S)/V(S,Y) • S has a clustered index on Y: T(R)B(S)/V(S,Y)
Example of index-join • T(R) = 10,000, B(R) = 1000 • T(S) = 5000, B(S) = 500, V(S,Y) = 100 • To compute R(X,Y) S(Y,Z) using a clustered Y-index on S: • 1000 + 10,000*(500/100) = 51,000 I/O’s Bad!!
However, things are not so bad in practice • Suppose we have the relations: • StarsIn(title, year, starName) • MovieStar(name, address, gender, birthdate) • And there is an index on MovieStar.name • Consider the SQL query: • SELECT birthdate • FROM StarsIn, MovieStar • WHERE title = 'King Kong' AND starName = name;
Practice (Cont’d) • We can first do the selection of those tuples in StarsIn relation with title=‘King Kong’. Suppose they are 10 such tuples. • Now, we know that stars take care to not have the same name with some other star. So, name is a key for the relation MovieStar. (V(MovieStar, name) = ?) • Hence, V(MovieStar, name) = T(MovieStar) • Finally the number of I/O’s is: • B(StarsIn) + T(name=‘King Kong’(StarsIn)) I/O’s for R clustered and • T(StarsIn) + T(name=‘King Kong’(StarsIn)) I/O’s for R non-clustered.
Joins using sorted indexes We want to compute R(X,Y) S(Y,Z) • If S has a B-tree index on Y, • Create sorted sublists of R only, and • Do a sort join, extracting the S-tuples in order through the index • Of both have B-tree index on Y, do a zigzag-join.
Example (B-Tree index on S[Y]) • T(R) = 10,000, B(R) = 1000, • T(S) = 5000, B(S) = 500, V(S,Y) = 100, S has a B-Tree index on Y • Assume that both relations and the indexes are clustered. • M = 101 buffers • Create 10 sorted sublists of R. Cost: 2B(R) • 10 buffers for sublists of R, 1 buffer for S (retrieved via index) • Join tuples from input buffers • Total cost: 2B(R) + B(R) + B(S) + index lookup = 2000 + 1000 + 500 + index lookup = 3500 + index lookup
Zigzag Join • Suppose we have B-Tree indexes on both S[Y] and R[Y]. • We can jump back and forth between the indexes finding Y-values that they share in common. • Tuples from R with Y-value that doesn’t appear in S need never be retrieved, and similarly tuples of S whose Y-value doesn’t appear in R need never be retrieved. Example. • Let the Y-values for R be: 1,3,4,4,4,5,6 • Let the Y-values for S be: 2,2,4,4,6,7 • Start with the 1 and 2. • Since 1<2 skip 1 in R. • Since 2<3 skip the 2’s in S. • Since 3<4 skip 3 in R. • Join 4’s. • …
Example (Zigzag Join) • T(R) = 10,000, B(R) = 1000, • T(S) = 5000, B(S) = 500, • S and R both have clustered B-Tree indexes on Y • There is no need to store either relation. • We use just 1000+500 disk I/O’s to read the blocks of R and S through their indexes. • We can determine from the indexes alone that a large fracion of R or S cannot match tuples of the other relation, so the cost might be considerably less than 1500 I/O’s.