1 / 42

8. External Sorting

8. External Sorting. Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort

lamond
Download Presentation

8. External Sorting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNALSTORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

  2. Sorting with Disk k - way merging “mergesort” merge internal sort . . . . . . . . . . . .

  3. Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted. 1 2 3 4 5 6 I 1 3 5 D1 …… 2 4 6 D2 ……

  4. 3 D1D2 …… 6 n D3D4 : the size of a run

  5. 1 3 5 7 Run size 2 4 6 8 1 3 5 7 2 4 6 8 3 12 34 56 78 6 1256 3478 12 12345678 24 How many passes? 1 + log2r (r # of initial runs)

  6. k-way merging … … …… … …… logkr ………………………………………………. …… # of passes 1+logkr # of I/O operations? O(nlogkr) better than 2-way merging !!!

  7. How about # of comparisons? Is k-way merging always better than 2-way merging?

  8. Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+logkr  #(P) #(P)  k r r run size 

  9. # of comparisons(k-way merge) 16 38 30 25 50 16 110 20 15 20 20 25 15 11 120 18 10 9 20 15 8 9 90 17 10 9 20 15 8 9 90 17 8 9 10 11 12 13 14 15 9 15 8 17 4 5 6 7 9 8 2 3 1 8 8 8

  10. How many comparisons in a pass? nlog2k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (logkr)(nlog2k) = (nlog2r) independent of k !!! #(c)  r 

  11. How to increase run size(initial run size) x1, x2, x3,…,xm, xm+1, xm+2, xm+3,…,x2m, x2m+1, x2m+2, x2m+3,… m keys m keys m keys r = # of runs =   Any improvement? Observation See p.94 in textbook !!! …...

  12. 4 11 11 2 5 4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this? 91 11 6 24 3 18 7 18

  13. A tree of losers 4 parent 2 loser 32 12 Updating pointers 18 ptr := winner.parent; 24 while ptr  nil do 91 if (ptr.loser.key < winner.key) then 11 interchange(ptr.loser, winner); end {if} ptr := ptr.parent; end {while} 11 91 24 18 winner

  14. Explain p.97-101, textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1

  15. Performance Analysis (Average size of runs) m0  # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m0 records (no proof) E. Friend (JACM, 3, (1966)) experiment  2m0 E. Moore (1961) Proved that 2m0 is the expected run length.

  16. Sketch of Moore’s Proof Snowplow falling snow 2m0 m0 uniform distribution  2m0

  17. Tape Sorting • Balanced k-way merging (similar to disk sorting) • Polyphase merging  • Cascade merging 

  18. Polyphase Merging (Motivation) • (R1, R2, …, R5000) • length (Ri)  20 bytes • Only 1000 records fitted in the internal memory at one time. ( 20k bytes) • 4 tapes available Balanced 2-way merge T1T2T3T4 R1,1000R1001,2000 R2001,3000R3001,4000   R4001,5000   R1,2000R2001,4000 R4001,5000 R1,4000R4001,5000     R1,5000  Total # of operations = 15000

  19. Tape 1 Tape 2 Tape 3 Tape 4 R1,1000R1001,2000R2001,3000  R3001,4000R4001,5000 (rewind) R3001,4000R4001,5000  R1,3000 R1,5000  • Total # of I/O operations 3000 + 5000 = 8000 Balanced Merge is not always best !!!

  20. What if only 3 tapes available? Tape 1 Tape 2 Tape 3 R1,1000 R1001,2000 R2001,3000 R3001,4000  R4001,5000 R1,2000 R2001,4000 R4001,5000 R1,2000 R2001,4000 R4001,5000 R1,4000 R4001,5000   R4001,5000 R1,4000 R1,5000  Total # of I/O Operations 5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!

  21. Tape 1 Tape 2 Tape 3 R1,1000 R1001,2000 R2001,3000 R3001,4000  R4001,5000 R1,2000 R4001,5000 R2001,4000 (rewind) R1,2000; 4001,5000 (rewind) R1,5000   Total # of I/O Operations 4000 + 3000 + 5000 = 11,000 !!!

  22. Polyphase merge T1T2T3T4T5T6 131 130 128 124 116  115 114 112 18  516 17 16 14  98 58 13 12  174 94 54 11  332 172 92 52  651 331 171 91 51 1291      How to assign initial runs?

  23. Cascade Merge T1T2T3T4T5T6 155 150 141 129 115  140 135 126 114  515 Pass 1 126 121 112  414 515 114 19  312 414 515 15  29 312 414 515 ( 15 29 312 414 515) 155  24 37 49 510 155 144  33 45 56 Pass 2 155 144 123  42 53 155 144 123 92  51 (155 144 123 92 51 ) 154 143 122 91 551 153 142 121  501 551 Pass 3 152 141  411 501 551 151  291 411 501 551 ( 151 291 411 501 551) Pass 4 1901     

  24. Polyphase Merge T1T2T3T4T5T6 phase 1 131 130 128 124 116  2 115 114 112 18  516 3 17 16 14  98 58 4 13 12  174 94 54 Gilstad(1960) 5 11  332 172 92 52 6 651 331 171 91 51 71291      {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?

  25. iaibicidiei 0 1 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 1 3 4 4 4 3 2 4 8 8 7 6 4 5 16 15 14 12 8 6 31 30 28 24 16

  26. (a0 + b0) (a0 + c0) (a0 + d0) (a0 + e0) a0 (a1 + b1) (a1 + c1) (a1 + d1) (a1 + e1) a1 (a2 + b2) (a2 + c2) (a2 + d2) (a2 + e2) a2 n an bn cn dn en n+1 an + bn an + cn an + dn an + en an an  bn  cn dn en

  27. iaibicidiei output 0 1 0 0 0 0 T6 1 1 1 1 1 1 T1 2 2 2 2 2 1 T2 3 4 4 4 3 2 T3 2 2 2 1 0 2 1 1 1 0 1 1 4 8 8 7 6 4 T4 5 16 15 14 12 8 T5 6 31 30 28 24 16 T6 7 61 59 55 47 31 T1T2T3T4T5

  28. n-1 an-1 bn-1 cn-1 dn-1 en-1 n an-1+bn-1 an-1+cn-1 an-1+dn-1 an-1+en-1 an-1 an bn cn dn en  en = an-1 dn = an-1 + en = an-1 + an-2 cn = an-1 + dn-1 = an-1 + (an-2 + en-2) = an-1 + an-2 + an-3 …………. en = an-1 dn = an-1 + an-2 cn = an-1 + an-2 + an-3 bn = an-1 + an-2 + an-3 + an-4 an = an-1 + an-2 + an-3 + an-4 + an-5 (a0 = 1, ai = 0, i = -1, -2, -3, -4)

  29. e = an-1 d = an-1 + an-2 c = an-1 + an-2 + an-3 b = an-1 + an-2 + an-3 + an-4 a = an-1 + an-2 + an-3 + an-4 + an-4

  30. i -4 -3 -2 -1 0 1 2 3 4 5 6 7 ai 0 0 0 0 1 1 2 4 8 16 31 61 1 bi 0 ci 0 di 0 ei 0

  31. 1 2 4 8 16 31 61 1 2 4 8 15 30 59 1 2 4 7 14 28 55 1 2 3 6 12 24 47 1 1 2 4 8 16 31

  32. ai = < 0, 0, 0, 0, 1, 1, 2, 4, 8, 16, 31, 61, …… >, i = -4, -3, -2, -1, 0, 1, 2,... “The kth order Fibonacci number” Fnk = Fn-1k + Fn-2k + …… + Fn-kk 0, 0  nk-2 Fnk= 1, n = k-1 e.g) The second order Fibonacci number 0 1 1 2 3 5 …… Fn2 = Fn-12 + Fn-22 0, if n = 0 Fn2 = 1, if n = 1 Fibonacci number !!! an = Fn+k-1k if k tapes(input) are used why?

  33. What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. Level T1 T2 T3 T4 T5 1 1 1 1 1 1 5 2 2 2 2 2 1 9 1 1 1 1 0 3 4 4 4 3 2 17 2 2 2 1 1 4 8 8 7 6 4 33 4 4 3 3 2 5 16 15 14 12 8 65>53 (8 7 7 6 4) ……………………………… T1 T2 T3 T4 T5 (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) (47) (48) (49) (50) (51) (52) (53)            

  34. T1 T2 T3 T4 T5 T6  (2) (2) (2) (3) (3) 18 17 16 14 58 (2) (2) (2) (3) 55 53 not best but simple and good !!! For better one, see Knuth !!!

  35. Example (3 tapes) T1 T2 T3 (k)8 (k)5  (k)3 (2k)5  (3k)3 (2k)2 0, 1, 1, 2, 3, 5, 8 (5k)2 (3k)1  (5k)1 (8k)1  (13k)1 Runs on two input tapes (k) # of runs run size(k) # of pairs # of I/O’s 8,5 1,1 5 10 5,3 2,1 3 9 3,2 3,2 2 10 2,1 5,3 1 8 1,1 8,5 1 13 1 13 How many passes over the data?

  36. Total number  Fs for some s. of initial runs the sth Fibonacci number Fs Fs-1 Fs-2 T1 T2 T3 Fs-1 Fs-2  Fs-3 Fs-2 Fs-3 Fs-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations =  # of passes =

  37. Lemma : [proof] (By induction on S) (s=2) LHS = RHS = (s=3) LHS = RHS = (s=k) Suppose that (s=k+1) Exercise !!! See page 106-107 in textbook !!!

  38. From the previous lemma, # of passes = Fs = r (1) why? . Golden Ratio !!! From (1) ,

  39. Theorem: Fs-1 Fs-2 Polyphase merge merge 3 tapes Fs = r = # of initial runs # of passes = 1.04 log2r

  40. APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING Tapes Phases Passes Pass/phase Growth percent ratio 3 2.078 lnS + 0.672 1.504 lnS + 0.992 72 1.6180340 4 1.641 lnS + 0.364 1.015 lnS + 0.965 62 1.8392868 5 1.524 lnS + 0.078 0.863 lnS + 0.921 57 1.9275620 6 1.479 lnS + 0.185 0.795 lnS + 0.864 54 1.9659482 7 1.460 lnS + 0.424 0.762 lnS + 0.797 52 1.9835828 8 1.451 lnS + 0.642 0.744 lnS + 0.723 51 1.9919642 9 1.447 lnS + 0.838 0.734 lnS + 0.646 51 1.9960312 10 1.445 lnS + 1.017 0.728 lnS + 0.568 50 1.9980295 20 1.443 lnS + 2.170 0.721 lnS– 0.030 50 1.9999981 APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING Tapes Phases Passes Growth ratio 3 2.078 lnS + 0.672 1.504 lnS + 0.992 1.6180840 4 1.235 lnS + 0.754 1.012 lnS + 0.820 2.2469796 5 0.946 lnS + 0.796 0.897 lnS + 0.800 2.8793852 6 0.796 lnS + 0.821 0.773 lnS + 0.808 3.5133371 7 0.703 lnS + 0.839 0.691 lnS + 0.822 4.1481149 8 0.639 lnS + 0.852 0.632 lnS + 0.834 4.7833861 9 0.592 lnS + 0.861 0.587 lnS + 0.845 5.4189757 10 0.555 lnS + 0.869 0.552 lnS + 0.854 6.0547828 20 0.397 lnS + 0.905 0.397 lnS + 0.901 12.4174426

  41. Cascade Merge Level aibicidiei 0 1 0 0 0 0 1 1 1 1 1 1 2 5 4 3 2 1 3 15 14 12 9 5 4 55 50 41 29 15 nanbncndnen n+1 an+bn+cnan+1bn+1cn+1dn+1 +dn+en -en -dn -cn -bn an+1an Perfect dist’n for detail see Knuth Vol III !!!

More Related