COP 3540 Data Structures with OOP

COP 3540 Data Structures with OOP Chapter 7 - Part 1 Advanced Sorting

Advanced Sorting • Two sorts we will cover first. • Shell Sort – an O(n(log2 n) 2) sort … in general, and ‘can approach’ O(n) performance! • Partitioning, an O(nlog2n) sort. • Then, we’ll cover the QuickSort.

Recall how the Insertion Sort worked. • Took an element out of the ‘array’ and assumed all elements ‘to the left’ were sorted. • We marked this spot. • And we extracted out that element. • We then • compared the element extracted out with the elements ‘to the left’ of this element and • ‘inserted’ this element into its proper place • shifting all elements to the right as needed to make room for this inserted element and fill the vacated spot.

Approach that helped us: • Constraints: • Helped ourselves by: • starting with a single element to the left – so knew ‘that’ element was sorted - certainly sorted unto itself. • Then we proceeded: • Slowly the elements to the ‘left’ of the marked element grew in sorted number, as new numbers find their proper place in the subarray to the left - while the unsorted elements to the right diminish in number.

PotentialProblems with the Insertion Sort •  Now, what happens if the new number to be sorted is very small (or very large) and our sort is ‘ascending (or descending)?’ • This may require a largenumber of ‘copies’ to the right to make room for this new element. • Can require a number of ‘copies’ close to ‘n’ in fact. • Average number of copies is clearly n/2. • For n elements to be sorted and an average of n/2 copies per element, we have n*n/2 or n2/2 copies. • That may result in a very inefficient sort. •  This is how the insertion sort is an O(n2) sort. • It is this number of copies (comparing and shifting) that decreasesitsperformance.

Shell Sort Approach • Want to reduce these numbersoflargeshifts • Shell sort does this by sorting a very small subset of numbers – like three or four: • Where the numbers themselves might be large distances apart (like in a large array) • and it sorts them withrespect to each other • By sorting a small number of numbers, very small (or very large) numbers can be put much more nearly ‘in place’ much more quickly than with other approaches. • How done?

Shell Sort uses the notion of a ‘computed Gap’ • The Shell Sort uses a computed ‘gap’ between numbers represented by an ‘h’ as the distance between numbers in each subset to be sorted. • 1. Sorts all numbers (say in the array of numbers) with the same ‘h’ (gap) • Like, numbers eight apart – or four apart… • Sorts these numbers with respect to each other. 2. Then, after doing this, the algorithm reduces thegap (or distance) to a smaller number, like maybe 4 apart. • 3. (Ultimately the gap has size = 1;) Then the algorithm ‘1-sorts’ the array using the insertion sort.

Example • Consider: sort threeelements at a time with respect to each other, where the numbers are some ‘h’ distance apart • ……………………………………………………. • For array size n=10, and if gap size h = 4, we have four sub-arrays: (We call this a 4-sort) • Indices: (0,4,8), (1,5,9), (2,6) and (3,7). These sets are sorted with respect to each other. (Note: all ten are sorted!) • Arrays are interleaved, but, again, sorted with respect to each other. • (Note: the integers are not yet in final spot.

Consider Improved Performance! • Recall again the Insertion Sort • Recalling how the insertion sort works, • veryefficient for arrays nearly sorted (fewer swaps and movement, and yet can be • veryinefficient (due to shifts and copies) if the data are very unsorted. • Particularly true for very large / very small numbers. • Shell sort does ‘n-sorting’ • Capitalizes on initial position of elements especially if they are far from where they might ultimately end up. • Brings numbers more quickly to final position…(or nearer) • Algorithm moves elements that may be very far apart much closer to their final position more quickly thus reducing copying and shifting and swapping! • Shell Sort canapproach O(n) performance: muchbetter than O(n2) !

What about Larger Arrays? Gap Size? • Using a carefully researched algorithm to compute optimumgapsize,. • DonKnuthdeveloped a ‘recursive’ relationship: • h= 3*h+1 to start with, and then, subsequent gaps at • (h-1)/3. • (note the ‘recursion’ in the formula itself. Uses value of h to compute new value of h. • These h-values are referred to as • interval sequence or gap sequence • and are recursively computed as functions of h. • In more detail:

Don Knuth’s algorithm will start with a 3-sort; that is, sort three numbers some distance apart. By Don Knuth’s research reveals, as it turns out (algorithm is a few slides ahead), for an array of size > 364 and < 1093, 3-sort with a gap size of 364; After that sort, use a gap size of 121; then gap size = 40; steadily decreasing… Developinitial gap size recursively by computing h: (algorithm is three slides ahead) h 3*h+1 h is determined by computing the largest value of h 1 4 computing h=h*3 +1 until h <= nElems/3 is false 4 13 13 40 So, computing h we see that h increases from 1 to 4 to 13 to 121 to 364 to …. 40 121 121 364 Once original gap is determined, sort continues and algorithm steadily reduces gap h from 364 to 121 .. 364 1093 until h = 1 1093 3280 So for array size > 364 and < 1093, gap = 364, etc. Gap sizes

Algorithm (covered in previous slide) • Algorithm first uses a short loop to generate the first (initial) value of h. • Then, once we have an initial value of h: • additional values of h are recursivelycomputed depending on the size of the array to be sorted. • Gap then starts with largest h-value. • For a 1000-element array, our initial gap size is 364. • After sorting, we would successively decrease the gap using the formula: h = (h-1)/3 as shown.

Note: • As it turns out, the algorithm actually sorts the first two elements of each group for a given gap first; then it goes back and sorts all three-element groups. This results in better performance time. • You will see this if you look carefully at the algorithm.

public void shellSort() { int inner, outer; long temp; int h = 1; // find initial value of h while (h <= nElems/3) // COMPUTE GAP SIZE h = h*3 + 1; // (1, 4, 13, 40, 121, 364,...) // Compute initial value of h // Value of h depends on original size of array, nElems. // start with largest gap (h-value) such that h < nElem/3 while (h > 0) // for 1000 element array, h = 364 { for (outer=h; outer<nElems; outer++) // h – sort the structure… { // for 1000 elements, h = 364; outer < nElems (1000); increment by one. temp = theArray[outer]; inner = outer; while (inner > h-1 && theArray[inner-h] >= temp) { theArray[inner] = theArray[inner-h]; inner -= h; } // end while theArray[inner] = temp; } // end for h = (h-1) / 3; // computes new gap: decreases h } // end while (h>0) } // end shellSort()

Google: Shell Sort Applet • Google: applet Lafore • You will get a number of applet choices. • Select and enjoy

Demo of Shell Sort • Do n=12 and notice how the gap varies across the bars. • You can see when h goes from 4 to 1. • Can see when it compares two in the interval … then three; then 1-sorts. • Do 100 sort. • It starts with h = 40. See it compares two of the three in the interval until there are only intervals of two left. • There is a larger number of intervals when it goes to h= 13. • Go to h=4 and see more intervals yet. • Finally, h=1. • Do this.

Shell Sort - Evaluation • Good for medium-sized array up to a few thousand items. • Shell Sort - O(n(log2n)2 ) is not as fast as the Quick Sort O(nlog2n) (coming soon) • Not so good for large files, but • Easy to implement • Requires very little extra space. • All sorts have a ‘worst case’ performance. • For Shell Sorts, the • Worse case is not much worse than average performance, so this is good! • (Worse case is very different than average case in a Quick Sort).

Final Remarks on Shell Sort • Other sequences are available. • Many alternatives available. Can experiment… • Ultimately, need to end up with a 1 • Forces last pass to be an insertion sort. • Guideline: • Gaps should be relatively prime. • Note Shell Sort’s numbers presented are not all prime (4, 40…). • This led to some earlier inefficiencies. • Experiments on Shell Sort yield performance mostly between O(n3/2) to O(n7/6)) • or from almost O(n2) down to almost O(n)! • Quite a difference and the difference is realized as n increases, which makes sense.

Partitioning

Partitioning • Partitioning is key to QuickSort thinking. • Partitioning divides data into two groups dependent upon the value of a key. • E.g. Divide students into two groups: < 3.0 gpa; > 3.0 • (Incidentally, why is a gpa of 3.0 important??) • We select a PivotValue: • value used to separate data items into two groups: • end up with Data < pivot value and Data > pivot value.

Pivot Values • Note: pivot point can be any key value. • Need not be a midpoint or value ‘half-way.’ • Would be nice if pivot were half-way point, but we have no way of knowing… •  Later we will see how the choice of the pivot impacts performance! • Pivotvalue used to separate array into left side and right side. • Ideally, we’d ‘like’ the sub-arrays to be roughly the same size, and we will work toward that reality.

Run Partition Algorithm to build Sub-Arrays • Once pivot value selected, we run the partition algorithm • Once run, • data on the left side of the pivot ‘belongs’ to the left side of the array (whatever number of elements may be on the left) and, • Data on the right side (>=) than the pivot value belong to the right side, however many elements are on the right side. • Note: Once partitioning is run, data is NOT sorted, • But, the items are a lot ‘closer’ to their final position… • And array is partitioned based on the pivot value.

The Partitioning Algorithm • Pick a pivot value… (more later) • Start with index at the left side of one partition. • Let’s call it left scan. • Move toward the right. • Compare element to pivot value. • If an element is less than the pivot value, leave it alone. Move to the right. • Advance to the right until element is >= pivot value and then Stop. • Starting with index at right most index on the right side • Let’s call it a right scan. • Move toward the left. • Compare element to the pivot value • If an element is >= pivot value, leave it alone; Move to the left. • Advance to the left until element is < pivot value and then Stop. • Swap the two values. • Iterate (back on the left; then right) until left and right scan are looking at the same entry. • ….

Let’s look at the applet

Partition.html • Google: applet Lafore • Run with n=12 with various orderings… • Run with n=40. Notice the partition first and the final ordering… • Note: in running the partitioning algorithm the data are not totally sorted – but they are a good bit closer.

Partitioning and the Pivot Value • Note partitioning is not stable. • As elements on one side are moved to the other side of the pivot value, they are NOT necessarily in the same relative positions in this ‘new’ partition! • In fact, they tend to be in reverse order. • Further, the numberof elements on each side neednotbethesame either – depends on the pivot value. • Very likely, there is NOT the same number of elements on each side of the pivot.

One (of several) Problems with Partitioning • 1. What if a poor pivot value were chosen such that all elements to the left were < pivot value? • Algorithm index keeps advancing. • End up with array index out of bounds exception. • Ditto the other way. See code below. while (leftPtr < right && theArray[++leftPtr] < pivot) ; // nop • Clearly – as any program that is to be robust, there must be checks on the pivot value.

Efficiency of the Partition • Algorithm is pretty efficient too • Runs in O(n) time. • Pointers move from opposite ends moving and swapping at a constant rate. • If n were 2n, the algorithm would take roughly twice as long. • Thus the algorithm operates in O(n) time – means time is proportional to the number of items being sorted.

Efficiency of the Partitioning Algorithm • Nonrandom data yields terrible results! • If data is inverselyordered, then every pair will be swapped, so n/2 swaps! Very inefficient! • Multiply this by n elements and we have a n2 /2. Poor! • Randomdata: yields fewer than n/2 swaps. • Some will already be in the right place. • On average for random data, about half of maximum no. of swaps will take place. • Regardless of random / non-random, both situations result in an efficiency proportional to n.

COP 3540 Data Structures with OOP