320 likes | 469 Views
6.3.1 Heapsort. Idea: two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse order: the maximum number should be at the root (not the minimum). Heapsort is an in-situ-Procedure.
E N D
6.3.1 Heapsort Idea:two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse order: the maximum number should be at the root (not the minimum). Heapsort is an in-situ-Procedure
Remembering Heaps: change the definition Heap with reverse order: • For each node x and each successor y of x the following holds: m(x) m(y), • left-complete, which means the levels are filled starting from the root and each level from left to right, • Implementation in an array, where the nodes are set in this order (from left to right) .
Second Phase: Heap 2. Output of the heap: take n-times the maximum (in the root, deletemax) and exchange it with the element at the end of the heap. Heap is reduced by one element and the subsequence of ordered elements at the end of the array grows one element longer. cost: O(n log n). Heap Ordered elements Ordered elements
First Phase: 1. Construction of the Heap: simple method: n-times insert Cost: O(n log n). making it better: consider the array a[1 … n ] as an already left-complete binary tree and let sink the elements in the following sequence ! a[n div 2] … a[2] a[1] (The elements a[n] … a[n div 2 +1] are already at the leafs.) HH The leafs of the heap
Formally: heap segment an array segment a[ i..k ] ( 1 ik <=n ) is said to be a heapsegment when following holds: for all j from {i,...,k} m(a[ j ]) m(a[ 2j ]) if 2j k and m(a[ j ]) m(a[ 2j+1]) if 2j+1 k If a[i+1..n] is already a heap segment we can convert a[i…n] into a heap segment by letting a[i] sink.
Cost calculation Be k = [log n+1]+ - 1. (the height of the complete portion of the heap) cost: For an element at level j from the root: k – j. alltogether: {j=0,…,k} (k-j)•2j = 2k•{i=0,…,k} i/2i =2 • 2k = O(n).
advantage: The new construction strategy is more efficient ! Usage: when only the m biggest elements are required: 1. construction in O(n) steps. 2. output of the m biggest elements in O(m•log n) steps. total cost: O( n + m•log n).
Addendum: Sorting with search trees Algorithm: • Construction of a search tree (e.g. AVL-tree) with the elements to be sorted by n insert opeartions. • Output of the elements in InOrder-sequence. Ordered sequence. cost: 1. O(n log n) with AVL-trees, 2. O(n). in total: O(n log n). optimal!
7.2 External Sorting Problem: Sorting big amount of data, as in external searching, stored in blocks (pages). efficiency: number of the access to pages should be kept low! Strategy: Sorting algorithm which processes the data sequentially (no frequent page exchanges): MergeSort!
General form for Merge mergesort(S) # retorna el conjunto S ordenado { if(S es vacío o tiene sólo 1 elemento) return(S); else { Dividir S en dos mitades A y B; A'=mergesort(A); B'=mergesort(B); return(merge(A',B')); } }
Meregesort en Archivos: Start: se tienenndatos en un archivog1, divididos en páginas de tamañob: Page 1: s1,…,sb Page 2: sb+1,…s2b … Page k: s(k-1)b+1 ,…,sn ( k = [n/b]+ ) Si se procesansecuencialmente se hacen k accesos a paginas, no n.
Variacion de MergeSort para external sorting MergeSort: Divide-and-Conquer-Algorithm Para external sorting: sin el pasodivide, solo merge. Definicion: run := subsecuenciaordenadadentro de un archivo. Estrategia: by merging increasingly bigger generated runs until everything is sorted.
Algoritmo 1. Step: Generar del input file g1 „starting runs“ y distribuirlas en dos archivosf1 and f2, con el mismonumero de runs (1) en cadauno (for this there are many strategies, later). Ahora: use 4 files f1, f2, g1, g2.
2. Step (main step): while (number of runs > 1) { • Merge each two runs from f1 and f2 to a double sized run alternating to g1 und g2, until there are no more runs in f1 and f2. • Merge each two runs from g1 and g2 to a double sized run alternating to f1 and f2, until there are no more runs in g1 und g2. } Each loop = two phases
Example: Start: g1: 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50 1st. step (length of starting run= 1): f1: 64 | 3 | 79 | 19 | 67 | 8 | 50 f2: 17 | 99 | 78 | 13 | 34 | 12 Main step, 1st. loop, part 1 (1st. Phase ): g1: 17, 64 | 78, 79 | 34, 67 | 50 g2: 3, 99 | 13, 19 | 8, 12 1st. loop, part 2 (2nd. Phase): f1: 3, 17, 64, 99 | 8, 12, 34, 67 | f2: 13, 19, 78, 79 | 50 |
Example continuation 1st. loop, part 2 (2nd. Phase): f1: 3, 17, 64, 99 | 8, 12, 34, 67 | f2: 13, 19, 78, 79 | 50 | 2nd. loop, part 1 (3rd. Phase): g1: 3, 13, 17, 19, 64, 78, 79, 99 | g2: 8, 12, 34, 50, 67 | 2nd. loop, part 2 (4th. Phase): f1: 3, 8, 12, 13, 17, 19, 34, 50, 64, 67, 78, 79, 99 | f2:
Implementation: For each file f1, f2, g1, g2 at least one page of them is stored in principal memory (RAM), even better, a second one might be stored as buffer. Read/write operations are made page-wise.
Costs Page accesses during 1. step and each phase: O(n/b) In each phase we divide the number of runs by 2, thus: Total number of accesses to pages: O((n/b) log n), when starting with runs of length 1. Internal computing time in 1 step and each phase is: O(n). Total internal computing time: O( n log n ).
Two variants of the first step: creation of the start runs • A) Direct mixing sort in primary memory („internally“) as many data as possible, for example m data sets First run of a (fixed!) length m, thus r := n/m starting runs. Then we have the total number of page accesses: O( (n/b) log(r) ).
Two variants of the first step: creation of the start runs • B) Natural mixing Creates starting runs of variable length. Advantage: we can take advantage of ordered subsequences that the file may contain Noteworthy: starting runs can be made longer by using the replacement-selection method by having a bigger primary storage !
Replacement-Selection Read m data from the input file in the primary memory (array). repeat { mark all data in the array as „now“. start a new run. while there is a „now“ marked data in the array { • select the smallest (smallest key) from all „now“ marked data, • print it in the output file, • replace the number in the array with a number read from the input file (if there are still some) mark it „now“ if it is bigger or equal to the last outputted data, else mark it as „not now“. } } Until there are no data in the input file.
Example: array in primary storage with capacity of 3 The input file has the following data: 64, 17, 3, 99, 79, 78, 19, 13, 67, 34, 8, 12, 50 In the array: („not now“ data written in parenthesis) Runs : 3, 17, 64, 78, 79, 99 | 13, 19, 34, 67 | 8, 12, 50
Implementation: In an array: • At the front: Heap for „now“ marked data, • At the back: refilled „not now“ data. Note: all „now“ elements go to the current generated run.
Expected length of the starting runs using the replace-select method: 2•m (m = size of the array in the primary storage = number of data that fit into primary storage) by equally probabilities distribution Even bigger if there is some previous sorting!
Multi-way merging Instead of using two input and two output files (alternating f1, f2 and g1, g2) Use k input and k output files, in order to me able to merge always k runs in one. In each step: take the smallest number among the k runs and output it to the current output file.
Cost: In each phase: number of runs is devided by k, Thus, if we have r starting runs we need only logk(r) phases (instead of log2(r)). Total number of accesses to pages: O( (n/b) logk(r) ). Internal computing time for each phase: O(n log2 (k)) Total internal computing time: O( n log2(k)logk(r)) = O( n log2(r) ).