CSE 326: Data Structures: Sorting

CSE 326: Data Structures: Sorting Lecture 16: Friday, Feb 14, 2003

Review: QuickSort procedure quickSortRecursive (Array A, int left, int right) if (left == right) return; int pivot = choosePivot(A, left, right); /* partition A s.t.: A[left], A[left+1], …, A[i]  pivot A[i+1], A[i+2], …, A[right]  pivot */ quickSortRecursive(A, left, i); quickSortRecursive(A, i+1, right); }

Review: The Partition i = left; j = right; repeat { while (A[i] < pivot) i++; while (A[j] > pivot) j--; if (i<j) {swap(A[i], A[j]); i++; j++;} else break; } Why do we need i++, j++ ?

Review: The Partition At the end: Q: How are these elements ? A: They are = pivot ! quickSortRecursive(A, left, j); quickSortRecursive(A, i, right);

Why is QuickSort Faster than Merge Sort? • Quicksort typically performs more comparisons than Mergesort, because partitions are not always perfectly balanced • Mergesort – n log n comparisons • Quicksort – 1.38 n log n comparisons on average • Quicksort performs many fewer copies, because on average half of the elements are on the correct side of the partition – while Mergesort copies every element when merging • Mergesort – 2n log n copies (using “temp array”) n log n copies (using “alternating array”) • Quicksort – n/2 log n copies on average

Stable Sorting Algorithms Typical sorting scenario: • Given N records: R[1], R[2], ..., R[N] • They have N keys: R[1].A, ..., R[N].A • Sort the records s.t.:R[1].A  R[2].A  ...  R[N].A A sorting algorithm is stable if: • If i < j and R[i].A = R[j].A then R[i] comes before R[j] in the output

Stable Sorting Algorithms Which of the following are stable sorting algorithms ? • Bubble sort • Insertion sort • Selection sort • Heap sort • Merge sort • Quick sort

Stable Sorting Algorithms Which of the following are stable sorting algorithms ? • Bubble sort yes • Insertion sort yes • Selection sort yes • Heap sort no • Merge sort no • Quick sort no We can always transform a non-stable sorting algorithm into a stable one How ?

Detour: Computing the Median • The median of A[1], A[2], …, A[N] is some A[k] s.t.: • There exists N/2 elements  A[k] • There exists N/2 elements  A[k] • Think of it as the perfect pivot ! • Very important in applications: • Median income v.s. average income • Median grade v.s. average grade • To compute: sort A[1], …, A[N], then median=A[N/2] • Time O(N log N) • Can we do it in O(N) time ?

Detour: Computing the Median int medianRecursive(Array A, int left, int right) { if (left==right) return A[left]; . . . Partition . . . if N/2  j return medianRecursive(A, left, j); if N/2  i return medianRecursive(A, i, right); return pivot } Int median(Array A, int N) { return medianRecursive(A, 0, N-1); } Why ?

Detour: Computing the Median • Best case running time:T(N) = T(N/2) + cN = T(N/4) + cN(1 + 1/2) = T(N/8) + cN(1 + 1/2 + 1/4) = . . . = T(1) + cN (1 + 1/2 + 1/4 + … 1/2k) = O(N) • Worst case = O(N2) • Average case = O(N) • Question: how can you compute the median in O(N) worst case time ? Note: it’s tricky.

Back to Sorting • Naïve sorting algorithms: • Bubble sort, insertion sort, selection sort • Time = O(N2) • Clever sorting algorithms: • Merge sort, heap sort, quick sort • Time = O(N log N) • I want to sort in O(N) ! • Is this possible ?

Could We Do Better? • Consider any sorting algorithm based on comparisons • Run it on A[1], A[2], ..., A[N] • Assume they are distinct • At each step it compares some A[i] with some A[j] • If A[i] < A[j] then it does something... • If A[i] > A[j] then it does something else...  Decision Tree !

Decision tree to sort list A,B,C Every possible execution of the algorithm corresponds to a root-to-leafpath in the tree.

Max depth of the decision tree • How many permutations are there of N numbers? • How many leaves does the tree have? • What’s the shallowest tree with a given number of leaves? • What is therefore the worst running time (number of comparisons) by the best possible sorting algorithm?

Max depth of the decision tree • How many permutations are there of N numbers? N! • How many leaves does the tree have? N! • What’s the shallowest tree with a given number of leaves? log(N!) • What is therefore the worst running time (number of comparisons) by the best possible sorting algorithm? log(N!)

Stirling’s approximation At least onebranch in thetree has thisdepth

If you forget Stirling’s formula... Theorem:Every algorithm that sorts by comparing keys takes (n log n) time

Bucket Sort • Now let’s sort in O(N) • Assume:A[0], A[1], …, A[N-1] {0, 1, …, M-1}M = not too big • Example: sort 1,000,000 person records on the first character of their last names: • Hence M = 128 (in practice: M = 27)

Bucket Sort int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } Stablesorting !

Bucket Sort • Running time: O(M+N) • Space: O(M+N) • Recall that M << N, hence time = O(N) • What about the Theorem that says sorting takes (N log N) ?? This is not realsorting, becauseit’s for trivial keys

Radix Sort • I still want to sort in time O(N): non-trivial keys • A[0], A[1], …, A[N-1] are strings • Very common in practice • Each string is:cd-1cd-2…c1c0, where c0, c1, …, cd-1{0, 1, …, M-1}M = 128 • Other example: decimal numbers

RadixSort • Radix = “The base of a number system” (Webster’s dictionary) • alternate terminology: radix is number of bits needed to represent 0 to base-1; can say “base 8” or “radix 3” • Used in 1890 U.S. census by Hollerith • Idea: BucketSort on each digit, bottom up.

The Magic of RadixSort • Input list: 126, 328, 636, 341, 416, 131, 328 • BucketSort on lower digit:341, 131, 126, 636, 416, 328, 328 • BucketSort result on next-higher digit:416, 126, 328, 328, 131, 636, 341 • BucketSort that result on highest digit:126, 131, 328, 328, 341, 416, 636

Inductive Proof that RadixSort Works • Keys: d-digit numbers, base B • (that wasn’t hard!) • Claim: after ith BucketSort, least significant i digits are sorted. • Base case: i=0. 0 digits are sorted. • Inductive step: Assume for i, prove for i+1. Consider two numbers: X, Y. Say Xi is ith digit of X: • Xi+1< Yi+1 then i+1th BucketSort will put them in order • Xi+1> Yi+1 , same thing • Xi+1= Yi+1 , order depends on last i digits. Induction hypothesis says already sorted for these digits because BucketSort is stable

Radix Sort int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } Running time: T = O(d(M+N)) = O(dN) = O(Size)

Radix Sort A= A= A=

Running time of Radixsort • N items, D digit keys of max value M • How many passes? • How much work per pass? • Total time?

Running time of Radixsort • N items, D digit keys of max value M • How many passes? D • How much work per pass? N + M • just in case M>N, need to account for time to empty out buckets between passes • Total time? O( D(N+M) )

Radix Sort • What is the size of the input ? Size = DN • Radix sort takes time O(Size) !!

Radix Sort • Variable length strings: • Can adapt Radix Sort to sort in time O(Size) ! • What about our Theorem ??

Radix Sort • Suppose we want to sort N distinct numbers • Represent them in decimal: • Need D=log N digits • Hence RadixSort takes time O(DN) = O(N log N) • The total Size of N keys is O(N log N) ! • No conflict with theory 

Sorting HUGE Data Sets • US Telephone Directory: • 300,000,000 records • 64-bytes per record • Name: 32 characters • Address: 54 characters • Telephone number: 10 characters • About 2 gigabytes of data • Sort this on a machine with 128 MB RAM… • Other examples?

Merge Sort Good for Something! • Basis for most external sorting routines • Can sort any number of records using a tiny amount of main memory • in extreme case, only need to keep 2 records in memory at any one time!

External MergeSort • Split input into two “tapes” (or areas of disk) • Merge tapes so that each group of 2 records is sorted • Split again • Merge tapes so that each group of 4 records is sorted • Repeat until data entirely sorted log N passes

Better External MergeSort • Suppose main memory can hold M records. • Initially read in groups of M records and sort them (e.g. with QuickSort). • Number of passes reduced to log(N/M)

CSE 326: Data Structures: Sorting

CSE 326: Data Structures: Sorting

Presentation Transcript

Data Structures and Algorithms

Data Structures

Goals of this Course

Parallel Programming in C with MPI and OpenMP

CSE 326: Data Structures Part Four: Trees

Data Management: Databases and Organizations Richard Watson

Chapter 22 – Data Structures and Collections

Data Structures and Algorithms

Data Management: Databases and Organizations Richard Watson

Sorting

Data Structures for 3D Searching

CS 61b: Final Review

Chapter 15 An Introduction to Data Structures

Sorting

Data Structures

ALGOL-60 GENERALITY AND HIERARCHY

DATA STRUCTURES ( C++ )

ALBANIAN TOURISM OFFER

Sorting Techniques

C++ Programming: Program Design Including Data Structures, Third Edition

C++ Programming: Program Design Including Data Structures, Fourth Edition