Priority Queues and Heapsort (9.1-9.4)

Priority Queues and Heapsort (9.1-9.4) • Priority queues are used for many purposes, such as job scheduling, shortest path, file compression … • Recall the definition of a Priority Queue: • operations insert(), delete_max() • also max(), change_priority(), join() • How would I sort a list, using a priority queue? for (i=0; i<n; i++) insert(A[i]); for (i=0; i<n; i++) cout << delete_max(); • How would I implement a priority queue? • how fast a sorting alg would your implementation yield? • can we do better?

Priority Queue Implementations insert max delete change join priority sorted array n 1 n n n unsorted array 1 n 1 1 n heap lg n lg n lg n lg n n binomial queue lg n lg n lg n lg n lg n (best) 1 lg n lg n 1 1

Heaps • How can we build a data structure to do this? • Hints: • we want to find the smallest element quickly • we want to be able to remove an element quickly • Tree of some sort? • Heap: • a full binary tree (all leaves at the same level, on left) • each element is at least as large as its children • (note: this is not a BST!) • How to delete the maximum? • How to add a number to a heap? • How to build a heap out of a list of numbers?

Insert • Implicit representationXTOGSMNAERAI (children of i at 2i and 2i+1) • How would I do an insert()? • add to the end of the array • repeat: if larger than parent, swap XTOGSMNAERIP XTOGSPNAERIM XTPGSONAERIM template <class Item> void insert(Item a[], Item newItem, int items) { n = ++items; a[n] = newItem; while (n>1 && a[n/2] < a[n]) { exch(a[n], a[n/2]); n/=2; } } Runtime? Q(log n)

DeleteMax • How would I delete X? • Move last element to root • If larger than either child, swap with larger child XTOGSMNAERAI ITOGSMNAERAX TIOGSMNAERAX TSOGIMNAERAX TSOGRMNAEIAX Item DeleteMax(Item a[], int items) { exch(a[1], a[items--]); reHeapify(a, items); return a[items+1]; } void reHeapify(Item a[], int items) { int n=1; while (2*n <= items) { int j = 2*n; if (j<items && a[j] < a[j+1]) j++; if (a[n] >= a[j]) break; exch(a[n],a[j]); n=j; } } Runtime? Q(log n)

BuildHeap (top down) • Given an array, e.g. ASORTINGEXAMPLE, how do I make it a heap? • Top-down: • for (i=2; i<=items; i++) • insert(a,a[i],i-1) • Runtime: • Q(n log n) • Can we do better?

BuildHeap (bottom up) • Suppose we use the reHeapify() function instead and work bottom-up. • For (i=items/2; i>=1; i--) • reHeapify(a) ASORTINGEX ASORXINGET AXORSINGET AXORTINGES XAORTINGES XTORAINGES XTORSINGEA Runtime? 1+1+1+1+…+2+2+..+4+.. n/4 + 2(n/8) + 3(n/16) + 4(n/32) + … n(1/4 + 2/8 + 3/16 + 4/32 + …) n * 1 Q(n) ! Top-down was Q(n log n); bottom up is Q(n)! cool!

Heapsort • BuildHeap() • for (i=1; i<=n; i++) DeleteMax(); • Runtime? Q(n log n) • Almost competitive with quicksort

Priority Queue • Operations insert(), max(), deleteMax() • Could implement with heap • Runtime for each operation? • insert(), deleteMax() – O(log n) • max() – O(1)

Example Application • Suppose you have a text, abracadabra. Want to compress it. • How many bits required? at 3 bits per letter, 33 bits. • Can we do better? • How about variable length codes? • In order to be able to decode the file again, we would need a prefix code: no code is the prefix of another. • How do we make a prefix code that compresses the text?

Huffman Coding • Note: Put the letters at the leaves of a binary tree. Left=0, Right=1. Voila! A prefix code. • Huffman coding: an optimal prefix code • Algorithm: use a priority queue. insert all letters according to frequency if there is only one tree left, done. else, a=deleteMin(); b=deleteMin(); make tree t out of a and b with weight a.weight() + b.weight(); insert(t)

Huffman coding example • abracadabra frequencies: • a: 5, b: 2, c: 1, d: 1, r: 2 • Huffman code: • a: 0, b: 100, c: 1010, d: 1011, r: 11 • bits: 5 * 1 + 2 * 3 + 1 * 4 + 1 * 4 + 2 * 2 = 23 • Finite automaton to decode – Q(n) • Time to encode? • Compute frequencies – O(n) • Build heap – O(1) assuming alphabet has constant size • Encode – O(n)

Huffman coding summary • Huffman coding is very frequently used • (You use it every time you watch HTDV or listen to mp3, for example) • Text files often compress to 60% of original size • In real life, Huffman coding is usually used in conjunction with a modeling algorithm… • E.g. jpeg compression: DCT, quantization, and Huffman coding • Text compression: dictionary + Huffman coding

Finite Automata and Regular Expressions • How can I decode some Huffman-encoded text efficiently? (hand-design a dfa to recognize) • Or: how can I find all instances of aardvark, aaardvark, aaaardvark, etc. or zyzzyva, zyzzzyva, zyzzzzyva, etc. in Microsoft Word? Unix? (grep) • All words with 2 or more As or Zs? • Important topic: regular expressions and finite automata. • theoretician: regular expressions are grammars that define regular languages • programmer: compact patterns for matching and replacing

DFA for abracadabra • Huffman code: A=0, B=100, C=1010, D=1011, E=11 • DFA: state read out new state 0 0 A 0 0 1 1 1 0 2 1 1 R 0 2 0 B 0 2 1 3 3 0 C 0 3 1 D 0 • (Actually, this looks just like the original tree, doesn’t it.)

Regular Expressions • Regular expressions are one of • a literal character • a (regular expression) – in parentheses • a concatenation of two R.E.s • the alternation (“or”) of two R.E.s, denoted + • the closure of an R.E., denoted * (i.e 0 or more occurrences) • Possibly additional syntactic sugar • Examples abracadabra abra(cadabra)* = {abra, abracadabra, abracadabracadabra, … } (a*b + ac)d (a(a+b)b*)* t(w+o)?o [? means 0 or 1 occurrences] aa+rdvark [+ means 1 or more occurrences]

Finite Automata • Regular language: any language defined by a R.E. • Finite automata: machines that recognize regular languages. • Deterministic Finite Automaton (DFA): • a set of states including a start state and one or more accepting states • a transition function: given current state and input letter, what’s the new state? • Non-deterministic Finite Automaton (NDFA): • like a DFA, but there may be • more than one transition out of a state on the same letter (Pick the right one non-deterministically, i.e. via lucky guess!) • epsilon-transitions, i.e optional transitions on no input letter

RE  NDFA • Given a Regular Expression, how can I build a DFA? • Work bottom up. • Letter: • Concatenation: • Or: Closure:

RE -> NDFA Ex • Construct an NDFA for the RE(A*B + AC)D A A* A*B A*B + AC (A*B + AC)D

NDFA -> DFA • Keep track of the set of states you are in. • On each new input letter, compute the new set of states you could be in. • The set of states for the DFA is the power set of the NDFA states. • I.e. up to 2n states, where there were n in the DFA.

Recognizing Regular Languages • Suppose your language is given by a DFA. How to recognize? • Build a table. One row for every (state,input letter) pair. Give resulting state. • For each letter of input string, compute new state • When done, check whether the last state is an accepting state. • Runtime? O(n), where n is the number of input letters • Another approach: use a C program to simulate NDFA with backtracking. Less space, more time. (egrep vs. fgrep?)

Examples • Unix grep • Perl $input =~ s/t[wo]?o/2; $input =~ s|<link[^>]*>\s*||gs; $input =~ s|\s*\@font-face\s*{.*?}||gs; $input =~ s|\s*mso-[^>"]*"|"|gis; $input =~ s/([^ ]+) +([^ ]+)/$2 $1/; $input =~ m/^[0-9]+\.?[0-9]*|\.[0-9]+$/; ($word1,$word2,$rest) = ($foo =~ m/^ *([^ ]+) +([^ ]+) +(.*)$/); $input=~s|<span[^>]*>\s*<br\s+clear="?all[^>]*>\s*</span>|<br clear="all"/>|gis;

Priority Queues and Heapsort (9.1-9.4)