140 likes | 265 Views
15.8 Algorithms using more than two passes. Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University. Multipass Algorithms . Previously , most of algorithms are required two passes. There is a case that we need more than two passes.
E N D
15.8 Algorithms using more than two passes Presented By: Seungbeom Ma (ID 125) Professor: Dr. T. Y. Lin Computer Science Department San Jose State University
Multipass Algorithms • Previously , most of algorithms are required two passes. • There is a case that we need more than two passes. • Case : Data is too big to store in main memory. • We have to hash or sort the relation with multipass algorithms.
Agenda • 1. Multipass Sort-Based Algorithm • 2.Multipass Hash-Based Algorithm
Multipass sort-based algorithm. • M: Number of Memory Buffers • R: Relation • B(R) : Number of blocks for holding relation. • BASIS: • 1. If R fits in M block (B (R) <= M). • 2. Reading R into main memory. • 3. Sorting R in the main memory with any sorting algorithm. • 4. Write the sorted relation to disk.
Multipass sort-based algorithm. • INDUCTION: (B(R)> M) • 1. If R does not fit into main memory then partitioning the blocks hold R into M groups, which call R1, R2, …, RM • 2.Recursively sorting Ri from i =1 to M • 3.Once sorting is done, the algorithm merges the M sorted sub-lists.
Performance: Multipass Sort-Based Algorithms 1) Each pass of a sorting algorithm: 1.Reading data from the disk. 2. Sorting data with any sorting algorithms 3. Writing data back to the disk. 2-1) (k)-pass sorting algorithm needs 2k B(R) disk I/O’s 2-2)To calculate (Multipass)-pass sorting algorithm needs = > A+ B A: 2(K-1 ) (B(R) + B(S) ) [ disk I/O operation to sort the sublists] B: B(R) + B(S)[ disk I/O operation to read the sorted the sublists in the final pass] Total: (2k-1)(B(R)+B(S)) disk I/O’s
Multipass Hash-Based Algorithms • 1. Hashing the relations into M-1 buckets, where M is number of memory buffers. • 2. Unary case: • It applies the operation to each bucket individually. • 1.Duplicate elimination (δ) and grouping (γ). • 1) Grouping:Min, Max, Count , Sum , AVG , which can group the data in the table • 2) Duplicate elimination: Distinct Basis: If the relation fits in M memory block, -> Reading relation into memory and perform the operations. • 3. Binary case: It applies the operation to each corresponding pair of buckets. • Query operations: union, intersection, difference , and join • If either relations fits in M-1 memory blocks, • -> Reading that relation into main memory M-1 blocks • -> Reading next relation to 1 block at a time into the Mth block • Then performing the operations.
INDUCTION • If Unary and Binary relation does not fit into the main memory buffers. • Hashing each relation into M-1 buckets. • Recursively performing the operation on each bucket or corresponding pair of buffers. • Accumulating the output from each buckets or pair.
Perfermance: Hash-Based Algorithms • R: Realtion. • Operations are like δ and γ • M: Buffers • U(M, k): Number of blocks in largest relation with k-pass hashing algorithm.
Performance:Induction Induction: 1. Assuming that the first step divides relation R into M-1 equal buckets. 2. The buckets for the next pass must be small enough to handle in k-1 passes 3.Since R is divided into M-1 buckets , we need to have (M-1)u(M, k-1).
Sort-Based VS Hash-Based 1. Sort-based can produce output in sorted order. It might be helpful to reduce rotational latency or seek time 2. Hash-based depends on buckets being of equal size. For binary operations, hash-based only limits size of smaller relation. Therefore, hash-based can be faster than sort-based for small size of relation.