120 likes | 237 Views
Query Execution 15.5 Two-pass Algorithms based on Hashing. By Swathi Vegesna. At a glimpse. Introduction Partitioning Relations by Hashing Algorithm for Duplicate Elimination Grouping and Aggregation Union, Intersection, and Difference Hash-Join Algorithm Sort based Vs Hash based
E N D
Query Execution15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna
At a glimpse • Introduction • Partitioning Relations byHashing • Algorithm for Duplicate Elimination • Grouping and Aggregation • Union, Intersection, and Difference • Hash-Join Algorithm • Sort based Vs Hash based • Summary
Introduction Hashing is done if the data is too big to store in main memory buffers. • Hash all the tuples of the argument(s) using an appropriate hash key. • For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value. • This reduces the size of the operand(s) by a factor equal to the number of buckets.
Partitioning Relations byHashing Algorithm: initialize M-1 buckets using M-1empty buffers; FOR each block b of relation RDO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer t o disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;
Duplicate Elimination • For the operation δ(R) hash R to M-1 Buckets. (Note that two copies of the same tuple t will hash to the same bucket) • Do duplicate elimination on each bucket Riindependently, using one-pass algorithm • The result is the union of δ(Ri), where Ri is the portion of R that hashes to the ith bucket
Requirements • Number of disk I/O's: 3*B(R) • B(R) < M(M-1), only then the two-pass, hash-based algorithm will work • In order for this to work, we need: • hash function h evenly distributes the tuples among the buckets • each bucket Rifits in main memory (to allow the one-pass algorithm) • i.e., B(R) ≤ M2
Grouping and Aggregation • Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes (Note: all tuples in the same group end up in the same bucket) • Use the one-pass algorithm to process each bucket independently • Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference • For binary operation we use the same hash function to hash tuples of both arguments. • R U S we hash both R and S to M-1 • R ∩ S we hash both R and S to 2(M-1) • R-S we hash both R and S to 2(M-1) • Requires 3(B(R)+B(S)) disk I/O’s. • Two pass hash based algorithm requires min(B(R)+B(S))≤ M2
Hash-Join Algorithm • Use same hash function for both relations; hash function should depend only on the join attributes • Hash R to M-1 buckets R1, R2, …, RM-1 • Hash S to M-1 buckets S1, S2, …, SM-1 • Do one-pass join of Riand Si, for all i • 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based • For binary operations, hash-based only limits size to min of arguments, not sum • Sort-based can produce output in sorted order, which can be helpful • Hash-based depends on buckets being of equal size • Sort-based algorithms can experience reduced rotational latency or seek time
Summary • Partitioning Relations byHashing • Algorithm for Duplicate Elimination • Grouping and Aggregation • Union, Intersection, and Difference • Hash-Join Algorithm • Sort based Vs Hash based