Algorithm Design and Analysis

LECTURE 8 • Greedy Algorithms V • Huffman Codes Algorithm Design and Analysis CSE 565 Adam Smith A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Review Questions Let G be a connected undirected graph with distinct edge weights. Answer true or false: • Let e be the cheapest edge in G. Some MST of G contains e? • Let e be the most expensive edge in G. No MST of G contains e? A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Review • Exercise: given an undirected graph G, consider spanning trees produced by four algorithms • BFS tree • DFS tree • shortest paths tree (Dijsktra) • MST • Find a graph where • all four trees are the same • all four trees must be different (note: DFS/BFS may depend on exploration order) A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Non-distinct edges? • Read in text A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Implementing MST algorithms • Prim: similar to Dijkstra • Kruskal: • Requires efficient data structure to keep track of “islands”: Union-Find data structure • We may revisit this later in the course • You should know how to implement Prim in O(mlogm/nn) time A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Implementation of Prim(G,w) IDEA: Maintain V – Sas a priority queue Q(as in Dijkstra). Key each vertex in Q with the weight of the least-weight edge connecting it to a vertex in S. • Q V • key[v]  ¥ for all vÎV • key[s]  0 for some arbitrary sÎV • whileQ¹  • dou  EXTRACT-MIN(Q) • for each vÎAdjacency-list[u] • do ifvÎQ and w(u, v) < key[v] • thenkey[v]  w(u, v) ⊳DECREASE-KEY • p[v]  u At the end, {(v, p[v])} forms the MST. A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Q(n) total n times degree(u) times Analysis of Prim • Q V • key[v]  ¥ for all vÎV • key[s]  0 for some arbitrary sÎV • whileQ¹  • dou  EXTRACT-MIN(Q) • for each vÎAdj[u] • do ifvÎQ and w(u, v) < key[v] • thenkey[v]  w(u, v) • p[v]  u Handshaking Lemma  Q(m)implicit DECREASE-KEY’s. Time: as in Dijkstra A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

n times degree(u) times Analysis of Prim • whileQ¹  • dou  EXTRACT-MIN(Q) • for each vÎAdj[u] • do ifvÎQ and w(u, v) < key[v] • thenkey[v]  w(u, v) • p[v]  u Handshaking Lemma  Q(m)implicit DECREASE-KEY’s. PQ Operation Prim Array Binary heap d-way Heap Fib heap † ExtractMin n n log n HW3 log n DecreaseKey m 1 log n HW3 1 Total n2 m log n m log m/n n m + n log n † Individual ops are amortized bounds A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Greedy Algorithms for MST • Kruskal's:Start with T = . Consider edges in ascending order of weights. Insert edge e in T unless doing so would create a cycle. • Reverse-Delete: Start with T = E. Consider edges in descending order of weights. Delete edge e from T unless doing so would disconnect T. • Prim's: Start with some root node s. Grow a tree T from s outward. At each step, add to T the cheapest edge e with exactly one endpoint in T. A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Union-Find Data Structures • With modifications, amortized time for tree structure is O(nAck(n)), where Ack(n), the Ackerman function grows much more slowly than log n. • See KT Chapter 4.6 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Huffman codes A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Prefix-free codes • Binary code maps characters in an alphabet (say {A,…,Z}) to binary strings • Prefix-free code: no codeword is a prefix of any other • ASCII: prefix-free (all symbols have same length) • Not prefix-free: • a  0 • b 1 • c 00 • d 01 • … • Why is prefix-free good? A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

A prefix-free code for a few letters • e.g. e 00, p 10011 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Source: WIkipedia

A prefix-free code • e.g. T  1001, U  1000011 Source: Jeff Erickson notes. A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

How good is a prefix-free code? • Given a text, let f[i] = # occurrences of letter i • Total number of symbols needed • How do we pick the best prefix-free code? A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Huffman’s Algorithm (1952) • Given individual letter frequencies f[1, .., n]: • Find the two least frequent letters i,j • Merge them into symbol with frequency f[i]+f[j] • Repeat • e.g. • a: 6 • b: 6 • c: 4 • d: 3 • e: 2 Theorem: Huffman algorithm finds an optimal prefix-free code A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Warming up • Lemma 0: Every optimal prefix-free code corresponds to a full binary tree. • (Full = every node has 0 or 2 children) • Lemma 1: Let x and y be two least frequent characters. There is an optimal code in which x and y are siblings. A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Huffman codes are optimal Proof by induction! • Base case: two symbols; only one full tree. • Induction step: • Suppose f[1], f[2] are smallest in f[1,…,n] • T is an optimal code for {1,…,n} • Lemma 1 ==> can choose T where 1,2 are siblings. • New symbol numbered n+1, with f[n+1] = f[1]+f[2] • T’ = code obtained by merging 1,2 into n+1 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Cost of T in terms of T’: • Let H be Huffman code for {1,…,n} • Let H’ be Huffman code for {3,…,n+1} • Induction hypothesis cost(H’) ≤ cost(T’) • cost(H) = cost(H’)+f[1]+f[2] ≤ cost(T). QED A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Notes • See Jeff Erickson’s lecture notes on greedy algorithms: • http://theory.cs.uiuc.edu/~jeffe/teaching/algorithms/ A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Data Compression for real ? • Generally, we don’t use letter-by-letter encoding • Instead, find frequently repeated substrings • Lempel-Ziv algorithm extremely common • also has deep connections to entropy • If we have time for string algorithms, we’ll cover this… A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Huffman codes and entropy • Given a set of frequencies, consider probabilities p[i] = f[i] / (f[1] + … + f[n]) • Entropy(p) = Σip[i] log(1/p[i]) • Huffman code has expected depth Entropy(p) ≤ Σip[i]depth(i) ≤ Entropy(p) +1 • To prove the upper bound, find some prefix free code where • depth(i) ≤ log(1/p[i]) +1 for every symbol i • Exercise! • The bound applies to Huffman too, by optimality A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Prefix-free encoding of arbitrary length strings • What if I want to send you a text • But you don’t know ahead of time how long it is? • 1: put length at the beginning: n+log(n) bits • requires me to know the length • 2: every B bits, put a special bit indicating whether or not we’re done: n(1+1/B) +B-1 bits • Can we do better? A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne

Algorithm Design and Analysis