Efficient Cosequential Processing: Sorting Large Files

제 7 장Cosequential Processing and the Sorting of Large Files

Cosequential Operations • Coordinated processing of two or more sequential lists to produce a single output list • Kinds of Operations • merging, union • matching • intersection • combination of above

Matching Operation • Output the names common to the two lists • Matching or an intersection • Four step 1. Initializing 2. Synchronizing 3. Handling end-of-file conditions 4. Recognizing errors

Matching Operation (2) • Algorithm • p261 Figure 7.2 • three-way conditional statement if NAME_1 < NAME_2 read the next from LIST_1 if NAME_1 > NAME_2 read the next from LIST_2 else output the name read the next from both list

Matching Operation (3) • Key of algorithm • always return to the head of the main loop • End-of-file condition • test MORE_NAMES_EXIST flag • until either of two list reaches end-of-file

Merging Two Lists • Based on matching operation • p264 Figure 7.5 • Difference • must read each of the lists completely • change MORE_NAMES_EXIST behavior • HIGH_VALUE • comes after all legal input values in the file’s ordered sequence

Assumptions Two or more input files are processed in a parallel fashion Each file is sorted Comments Output may be the same as one of the input files Not necessary that all files have the same record structures Cosequential Processing Model

Assumptions must exist a high key and a low key value records are in logical sorted order Comments not necessary, but decreases complexity physical ordering can have a large impact on processing Cosequential Processing Model (2)

Assumptions for each file, only one current record records should be manipulated only in internal memory Comments not prohibits looking ahead or looking back, but such operations should be restricted to subprocedures cannot alter a record Cosequential Processing Model (3)

Cosequential Processing Model (4) • Components • Initialization • read from the first record in the files • Synchronization loop • as long as relevant records remain • Selection in main synchronization loop • Use high values as end-of-file condition • no special code to deal with end-of-file

Cosequential Processing Model (5) • Components - cont’d • I/O and error detection are to be relegated to subprocesses • hide details • Simple and robust • Example: General Ledger Program • pp. 268~276

Multiway Merging • K-way merge • merge K input lists to create a single, ordered output list • p277 Figure 7.16 • less then 8 or so

Multiway Merging (2) • Selection Tree • K-way merge • set of comparisons becomes expensive • time vs space trade-off • a kind of tournament tree • each higher-level node represents the winner of the two descendent keys • the depth of tree is log2 K

Selection Tree

Sorting in RAM • Can we improve on the time of RAM sort? • perform some of parts in parallel • selection tree is good but cannot used to sort entire file • Heapsort • sorting and reading can occur in parallel • keeping all of the keys in heap

Heapsort • Heap • 자식 노드는 부모노드보다 크거나 같다. • 노드 i의 자식 노드는 2i와 2i+1 • Fig 7.20, Fig 7.21 • Processing overlap with I/O • use more than one buffer • p284 Figure 7.22 • fill buffer while building heap • Procedure for outputting : Fig 7.23

Sorting Large Files on Disk • Keysort shortcomings • cost of seeking • cannot sort really large file • all key/pointer pairs in RAM • Multiway merge algorithm • run: sorted subfile

Sorting Large Files on Disk (2)

Sorting Large Files on Disk (3) • Multiway merging • can be extended to files of any size • reading during the run creation step • no seeking due to sequential reading • reading and writing during merging • sequential • I/O overlap using heapsorting • tape can be used

How Much Time Does a Merge Sort Take? • Merge Sort vs Key Sort • pp. 287~290 (10분대 5시간) • 4 Steps • reading records and forming runs • writing sorted runs • reading sorted runs for merging • writing sorted file

Sorting a Very Large File • Kinds of I/O • sort phase • sequential if using heapsort • no improvement • merge phase • random access(run의 개수에 비례) • Ways to improve performance • cut down the number of random access in the merge phase

Cost of Increasing the File Size • For a K-way merge of K runs, • the buffer size for each of the runs 1/K * size of RAM = 1/K * size of each run • merge operation requires K2 seeks • Merge sort is O(K2) operation

Cost of Increasing the File Size (2) • Ways to reduce time • more hardware • merge more than one step • reducing the order of each merge • increasing the buffer size for each run • Increase the length of the initial sorted runs • Overlap I/O operations

Hardware-based Improvements • Possible configuration • increasing the amount of RAM • increasing the number of disk drives • increasing the number of I/O channels

Multiple-Step Merging • Break the original set of runs into small groups and merge the runs in these groups separately • Fewer seeks, but extra transmission time in second pass • Read every record twice • to form the intermediate runs and to form the final sorted file

Multiple-Step Merging (2) • Essence of multiple-step merging • increase the available buffer space for run • extra pass vs random access decreasing • More than two steps? • reduced seek and rotational times vs transmission times

Increasing Run Lengths • A longer initial run • fewer total runs • bigger buffers • fewer seeks • Replacement selection

Replacement Selection • Idea • aways select the key from memory that has the lowest value • output the key • replacing it with a new key from the input list • Implementation: p299 • p300 Figure 7.27

Replacement Seletion (2) • What about a key arriving in memory too late to be output into its proper position? • use of second heap • p301 Figure 7.28

Replacement Selection (4) • Two questions • Given P locations in memory, how long a run can we expect replacement selection to produce, on the average? • pp. 301~302 • What are the costs of using replacement selection? • pp. 303~304 • less than 1/3 as many seeks as RAM sorting

Efficient Cosequential Processing: Sorting Large Files