120 likes | 242 Views
LEARNING OBJECTIVES. Sorting of large files merge sort performance of merge sort multi-step merge sort. Sorting of Large Files. If a file is too large to be sorted in main memory then it has to be sorted on the disk.
E N D
LEARNING OBJECTIVES • Sorting of large files • merge sort • performance of merge sort • multi-step merge sort CPSC 231 Sorting Large Files (D.H.)
Sorting of Large Files • If a file is too large to be sorted in main memory then it has to be sorted on the disk. • Example: If a file consists of 8 000 000 records and each record is 100 bytes long then the file size is approximately 800 MB. • If a computer has 8 MB of RAM available for sorting then only a small part of this file would fit into main memory. CPSC 231 Sorting Large Files (D.H.)
Merge Sort • If we do not have enough of available RAM to sort the entire file, we may sort parts of the file, save the sorted sub-files (runs) on the disk and then use the K-way merge to sort the entire file. • A run is a sorted subset of file which is used later to sort the entire file. Runs can be created using a heap sort. • What is a maximum size of a run in the example on the previous slide? CPSC 231 Sorting Large Files (D.H.)
Pros of the Merge Sort • Can sort very large files. • Reading of the input file is sequential. • Reading of run and writing the output file is also sequential. • If heap sort is used for sorting of the runs then we can overlap I/O and sorting. • Since I/O is largely sequential, this method can be used for sorting files on tapes. • See fig. 8.21 p.320 CPSC 231 Sorting Large Files (D.H.)
Performance of Merge Sort • Merge sort requires I/O time for the following operations: • reading all records into memory for sorting and forming runs • writing sorted runs to disk • reading sorted runs into main memory • writing sorted file to the disk CPSC 231 Sorting Large Files (D.H.)
Merge Sort versus Key Sort • It takes approximately 6 minutes to sort an 800 MB file from our example on a Seagate Cheetah 9 hard disk (track to track seek time = 11msec) • It would have taken approximately 24 hours to sort the same file using the Key Sort algorithm. CPSC 231 Sorting Large Files (D.H.)
Sorting a File that is Even Larger • To sort a file that is ten times larger we need to do more seeks on the disk (since the main memory is the same, we have to create more runs and perform more seeks to merge those runs) • It takes approximately 2 hours and six minutes to merge sort an 8 GB file on the Seagate Cheetah 9 disk drive. CPSC 231 Sorting Large Files (D.H.)
The cost of merging a bigger file • The number of seeks needed to merge a file that is 10 times larger than the original file is 100 times larger. WHY? • In general, for a K-way merge of K runs where each run is as large as the memory space available, the buffer size for each of the runs is: (1/K)*size of each run CPSC 231 Sorting Large Files (D.H.)
The number of seeks needed to merge a big file • K seeks are needed to read all records in each individual run. • Since there are K runs altogether, then the merge operation requires: K2 seeks. Thus if a file is N times bigger, N2 more seeks are needed to merge it. CPSC 231 Sorting Large Files (D.H.)
How to improve performance of merge sort? • Allocate more hardware: more main memory, multiple disk drives and I/O channels. • Perform the merge in more than one step. • Algorithmically increase the lengths of the initial sorted runs. • Find ways to overlap I/O operations. CPSC 231 Sorting Large Files (D.H.)
Multi-Step Merge • Multi-step merge is a merge in which not all runs are merged in one step. Rather, several sets of runs are merged separately, each set producing one long run consisting of the records from all its runs. These new, longer sets are then merged, either all together or in several sets. • See example of a two-step merge fig. 8.23 p.330 CPSC 231 Sorting Large Files (D.H.)
Pros and Cons of Multi-Step Merge • Con: it requires that each record is read twice (once to form the intermediate runs and again to form the final sorted file) • Pros: We can create large runs by using bigger buffers and thus reduce the number of disk accesses. In some cases multi-step merge is the only reasonable way to perform a merge on tape if the number of tape drives is limited. CPSC 231 Sorting Large Files (D.H.)