140 likes | 263 Views
Two-Pass Algorithms Based on Sorting. Prepared By: Ronak Shah ID: 116. Introduction. In two-pass algorithms, data from the operand relations is read into main memory, processed in some way, written out to disk again, and then reread from disk to complete the operation.
E N D
Two-Pass Algorithms Based on Sorting Prepared By: Ronak Shah ID: 116
Introduction • In two-pass algorithms, data from the operand relations is read into main memory, processed in some way, written out to disk again, and then reread from disk to complete the operation. • In this section, we consider sorting as tool from implementing relational operations. The basic idea is as follows if we have large relation R, where B(R) is larger than M, the number of memory buffers we have available, then we can repeatedly
Read M blocks of R in to main memory • Sort these M blocks in main memory, using efficient, main memory algorithm. • Write sorted list into M blocks of disk, refer this contents of the blocks as one of the sorted sub list of R.
Duplicate elimination using sorting • To perform δ(R) operation in two passes, we sort tuples of R in sublists. Then we use available memory to hold one block from each stored sublists and then repeatedly copy one to the output and ignore all tuples identical to it.
The no. of disk I/O’s performed by this algorithm, 1). B(R) to read each block of R when creating the stored sublists. 2). B(R) to write each of the stored sublists to disk. 3). B(R) to read each block from the sublists at the appropriate time. So , the total cost of this algorithm is 3B(R).
Grouping and aggregation using sorting • Reads the tuples of R into memory, M blocks at a time. Sort each M blocks, using the grouping attributes of L as the sort key. Write each sorted sublists on disk. • Use one main memory buffer for each sublist, and initially load the first block of each sublists into its buffer. • Repeatedly find least value of the sort key present among the first available tuples in the buffers. • As for the δ algorithm, this two phase algorithm for γ takes 3B(R) disk I/O’s and will work as long as B(R) <= M^2
A sort based union algorithm • When bag-union is wanted, one pass algorithm is used in that we simply copy both relation, works regardless of the size of arguments, so there is no need to consider a two pass algorithm for Union bag. • The one pass algorithm for Us only works when at least one relation is smaller than the available main memory. So we should consider two phase algorithm for set union. To compute R Us S, we do the following steps, 1. Repeatedly bring M blocks of R into main memory, sort their tuples and write the resulting sorted sublists back to disk. 2.Do the same for S, to create sorted sublist for relation S.
3.Use one main memory buffer for each sublist of R and S. Initialize each with first block from the corresponding sublist. 4.Repeatedly find the first remaining tuple t among all buffers. Copy t to the output , and remove from the buffers all copies of t.
A simple sort-based join algorithm Given relation R(x,y) and S(y,z) to join, and given M blocks of main memory for buffers, 1. Sort R, using a two phase, multiway merge sort, with y as the sort key. 2. Sort S similarly 3. Merge the sorted R and S. Generally we use only two buffers, one for the current block of R and the other for current block of S. The following steps are done repeatedly. a. Find the least value y of the join attributes Y that is currently at the front of the blocks for R and S. b. If y doesn’t appear at the front of the other relation, then remove the tuples with sort key y.
c. Otherwise identify all the tuples from both relation having sort key y d. Output all the tuples that can be formed by joining tuples from R and S with a common Y value y. e. If either relation has no more unconsidered tuples in main memory reload the buffer for that relation. • The simple sort join uses 5(B(R) + B(S)) disk I/O’s • It requires B(R)<=M^2 and B(S)<=M^2 to work
Summary of sort-based algorithms Main memory and disk I/O requirements for sort based algorithms